nep-big New Economics Papers
on Big Data
Issue of 2022‒04‒11
sixteen papers chosen by
Tom Coupé
University of Canterbury

  1. Time and the Value of Data By Ehsan Valavi; Joel Hestness; Newsha Ardalani; Marco Iansiti
  2. Macroeconomic Predictions Using Payments Data and Machine Learning By James Chapman; Ajit Desai
  3. Deep Regression Ensembles By Antoine Didisheim; Bryan T. Kelly; Semyon Malamud
  4. Using Past Violence and Current News to Predict Changes in Violence By Mueller, H.; Rauh, C.
  5. On the Use of Satellite-Based Vehicle Flows Data to Assess Local Economic Activity: The Case of Philippine Cities By Go, Eugenia; Nakajima, Kentaro; Sawada , Yasuyuki; Taniguchi, Kiyoshi
  6. Informal Loans in Thailand: Stylized Facts and Empirical Analysis By Pim Pinitjitsamut; Wisarut Suwanprasert
  7. Computing Black Scholes with Uncertain Volatility-A Machine Learning Approach By Kathrin Hellmuth; Christian Klingenberg
  8. Solving Multi-Period Financial Planning Models: Combining Monte Carlo Tree Search and Neural Networks By Af\c{s}ar Onat Ayd{\i}nhan; Xiaoyue Li; John M. Mulvey
  9. Artificial Intelligence and Auction Design By Martino Banchio; Andrzej Skrzypacz
  10. Games of Artificial Intelligence: A Continuous-Time Approach By Martino Banchio; Giacomo Mantegazza
  11. Using big data for generating firm-level innovation indicators: A literature review By Rammer, Christian; Es-Sadki, Nordine
  12. Media Slant is Contagious By Philine Widmer; Sergio Galletta; Elliott Ash
  13. Employment Outcomes for Social Security Disability Insurance Applicants Who Use Opioids By April Yanyuan Wu; Denise Hoffman; Paul O'Leary; Dara Lee Luca
  14. You reap what (you think) you sow? Evidence on farmers’behavioral adjustments in the case of correct crop varietal identification By Paola Mallia
  15. Hidden hazards and Screening Policy : Predicting Undetected Lead Exposure in Illinois Using Machine Learning By Abbasi, Ali; Gazze, Ludovica; Pals, Bridget
  16. Change from the COVID-19 pandemic to a New Normal: Two Years of Documenting Consumption Behavior with Big Data (Japanese) By KONISHI Yoko; SAITO Takashi; KANAI Hajime; IGEI Naoya; MIZUMURA Junichi; SHIGA Kyoko; SUEYASU Keita; HAMAGUCHI Ryosuke

  1. By: Ehsan Valavi; Joel Hestness; Newsha Ardalani; Marco Iansiti
    Abstract: Managers often believe that collecting more data will continually improve the accuracy of their machine learning models. However, we argue in this paper that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data. In addition, we argue that increasing the stock of data by including older datasets may, in fact, damage the model's accuracy. Expectedly, the model's accuracy improves by increasing the flow of data (defined as data collection rate); however, it requires other tradeoffs in terms of refreshing or retraining machine learning models more frequently. Using these results, we investigate how the business value created by machine learning models scales with data and when the stock of data establishes a sustainable competitive advantage. We argue that data's time-dependency weakens the barrier to entry that the stock of data creates. As a result, a competing firm equipped with a limited (yet sufficient) amount of recent data can develop more accurate models. This result, coupled with the fact that older datasets may deteriorate models' accuracy, suggests that created business value doesn't scale with the stock of available data unless the firm offloads less relevant data from its data repository. Consequently, a firm's growth policy should incorporate a balance between the stock of historical data and the flow of new data. We complement our theoretical results with an experiment. In the experiment, we empirically measure the loss in the accuracy of a next word prediction model trained on datasets from various time periods. Our empirical measurements confirm the economic significance of the value decline over time. For example, 100MB of text data, after seven years, becomes as valuable as 50MB of current data for the next word prediction task.
    Date: 2022–03
  2. By: James Chapman; Ajit Desai
    Abstract: Predicting the economy’s short-term dynamics—a vital input to economic agents’ decision-making process—often uses lagged indicators in linear models. This is typically sufficient during normal times but could prove inadequate during crisis periods such as COVID-19. This paper demonstrates: (a) that payments systems data which capture a variety of economic transactions can assist in estimating the state of the economy in real time and (b) that machine learning can provide a set of econometric tools to effectively handle a wide variety in payments data and capture sudden and large effects from a crisis. Further, we mitigate the interpretability and overfitting challenges of machine learning models by using the Shapley value-based approach to quantify the marginal contribution of payments data and by devising a novel cross-validation strategy tailored to macroeconomic prediction models.
    Keywords: Business fluctuations and cycles; Econometric and statistical methods; Payment clearing and settlement systems
    JEL: C53 C55 E37 E42 E52
    Date: 2022–03
  3. By: Antoine Didisheim (Swiss Finance Institute, UNIL); Bryan T. Kelly (Yale SOM; AQR Capital Management, LLC; National Bureau of Economic Research (NBER)); Semyon Malamud (Ecole Polytechnique Federale de Lausanne; Centre for Economic Policy Research (CEPR); Swiss Finance Institute)
    Abstract: We introduce a methodology for designing and training deep neural networks (DNN) that we call “Deep Regression Ensembles" (DRE). It bridges the gap between DNN and two-layer neural networks trained with random feature regression. Each layer of DRE has two components, randomly drawn input weights and output weights trained myopically (as if the final output layer) using linear ridge regression. Within a layer, each neuron uses a different subset of inputs and a different ridge penalty, constituting an ensemble of random feature ridge regressions. Our experiments show that a single DRE architecture is at par with or exceeds state-of-the-art DNN in many data sets. Yet, because DRE neural weights are either known in closed-form or randomly drawn, its computational cost is orders of magnitude smaller than DNN.
    Keywords: Deep learning, Neural network, Random features, Ensembles
    Date: 2022–03
  4. By: Mueller, H.; Rauh, C.
    Abstract: This article proposes a new method for predicting escalations and de†escalations of violence using a model which relies on conflict history and text features. The text features are generated from over 3.5 million newspaper articles using a so†called topic†model. We show that the combined model relies to a large extent on conflict dynamics, but that text is able to contribute meaningfully to the prediction of rare outbreaks of violence in previously peaceful countries. Given the very powerful dynamics of the conflict trap these cases are particularly important for prevention efforts.
    Keywords: Conflict, prediction, machine learning, LDA, topic model, battle deaths, ViEWS prediction competition, random forest
    JEL: F21 C53 C55
    Date: 2022–03–22
  5. By: Go, Eugenia (Asian Development Bank); Nakajima, Kentaro (Hitotsubashi University); Sawada , Yasuyuki (University of Tokyo); Taniguchi, Kiyoshi (Asian Development Bank)
    Abstract: The lack of suitable data is a key challenge in ex-post policy evaluations. This paper proposes a novel data to measure local economic activities using vehicle counts in each 500 meter (m) x 500 m tile. The metric is derived from high resolution satellite images using a machine learning algorithm. Using the opening of the new international airport terminal in Cebu, Philippines, as a quasi-experiment, we estimate the impact of the new infrastructure on the local economy of Metro Cebu. Results of the difference-in-differences analysis show that the new terminal significantly increased vehicle traffic in urban Cebu. The effect decays with distance from the airport, is stronger in areas where hotels are located, and is most pronounced in the peak months for international tourists. These findings imply that the opening of the new international terminal has enhanced Cebu's local economy through international tourism.
    Keywords: transportation infrastructure; satellite imagery data
    JEL: R11
    Date: 2022–03–14
  6. By: Pim Pinitjitsamut; Wisarut Suwanprasert
    Abstract: This paper examines informal loans in Thailand using household survey data covering 4,800 individuals in 12 provinces across Thailand’s six regions. We proceed in three steps. First, we establish stylized facts about informal loans. Second, we estimate the effects of household characteristics on the decision to take out an informal loan and the amount of informal loan. We find that age, the number of household members, their savings, and the amount of existing formal loans are the main factors that drive the decision to take out an informal loan. The main determinations of the amount of informal loan are the interest rate, savings, the amount of existing formal loans, the number of household members, and personal income. Third, we train three machine learning models, namely K–Nearest Neighbors, Random Forest, and Gradient Boosting, to predict whether an individual will take out an informal loan and the amount an individual has borrowed through informal loans. We find that the Gradient Boosting technique with the top 15 most important features has the highest prediction rate of 76.46 percent, making it the best model for data classification. Generally, Random Forest outperforms the other two algorithms in both classifying data and predicting the amount of informal loans.
    Keywords: Informal Loans; Machine Learning; Shadow Economy; Thailand; Loan Sharks
    JEL: E26 G51 O16 O17
    Date: 2022–02
  7. By: Kathrin Hellmuth; Christian Klingenberg
    Abstract: In financial mathematics, it is a typical approach to approximate financial markets operating in discrete time by continuous-time models such as the Black Scholes model. Fitting this model gives rise to difficulties due to the discrete nature of market data. We thus model the pricing process of financial derivatives by the Black Scholes equation, where the volatility is a function of a finite number of random variables. This reflects an influence of uncertain factors when determining volatility. The aim is to quantify the effect of this uncertainty when computing the price of derivatives. Our underlying method is the generalized Polynomial Chaos (gPC) method in order to numerically compute the uncertainty of the solution by the stochastic Galerkin approach and a finite difference method. We present an efficient numerical variation of this method, which is based on a machine learning technique, the so-called Bi-Fidelity approach. This is illustrated with numerical examples.
    Date: 2022–02
  8. By: Af\c{s}ar Onat Ayd{\i}nhan; Xiaoyue Li; John M. Mulvey
    Abstract: This paper introduces the MCTS algorithm to the financial word and focuses on solving significant multi-period financial planning models by combining a Monte Carlo Tree Search algorithm with a deep neural network. The MCTS provides an advanced start for the neural network so that the combined method outperforms either approach alone, yielding competitive results. Several innovations improve the computations, including a variant of the upper confidence bound applied to trees (UTC) and a special lookup search. We compare the two-step algorithm with employing dynamic programs/neural networks. Both approaches solve regime switching models with 50-time steps and transaction costs with twelve asset categories. Heretofore, these problems have been outside the range of solvable optimization models via traditional algorithms.
    Date: 2022–02
  9. By: Martino Banchio; Andrzej Skrzypacz
    Abstract: Motivated by online advertising auctions, we study auction design in repeated auctions played by simple Artificial Intelligence algorithms (Q-learning). We find that first-price auctions with no additional feedback lead to tacit-collusive outcomes (bids lower than values), while second-price auctions do not. We show that the difference is driven by the incentive in first-price auctions to outbid opponents by just one bid increment. This facilitates re-coordination on low bids after a phase of experimentation. We also show that providing information about lowest bid to win, as introduced by Google at the time of switch to first-price auctions, increases competitiveness of auctions.
    Date: 2022–02
  10. By: Martino Banchio; Giacomo Mantegazza
    Abstract: This paper studies the strategic interaction of algorithms in economic games. We analyze games where learning algorithms play against each other while searching for the best strategy. We first establish a fluid approximation technique that enables us to characterize the learning outcomes in continuous time. This tool allows to identify the equilibria of games played by Artificial Intelligence algorithms and perform comparative statics analysis. Thus, our results bridge a gap between traditional learning theory and applied models, allowing quantitative analysis of traditionally experimental systems. We describe the outcomes of a social dilemma, and we provide analytical guidance for the design of pricing algorithms in a Bertrand game. We uncover a new phenomenon, the coordination bias, which explains how algorithms may fail to learn dominant strategies.
    Date: 2022–02
  11. By: Rammer, Christian; Es-Sadki, Nordine
    Abstract: Obtaining indicators on innovation activities of firms has been a challenge in economic research for a long time. The most frequently used indicators - R&D expenditure and patents - provide an incomplete picture as they represent inputs and throughputs in the innovation process. Output measurement of innovation has strongly been relying on survey data such as the Community Innovation Survey (CIS), but suffers from several short-comings typical to sample surveys, including incomplete coverage of the firm sector, low timeliness and limited comparability across industries and firms. The availability of big data sources has initiated new efforts to collect innovation data at the firm level. This paper discusses recent attempts of using digital big data sources on firms for generating firm-level innovation indicators, including Websites and social media. It summarises main challenges when using big data and proposes avenues for future research.
    Keywords: Big data,innovation indicators,CIS,literature review
    JEL: O30 C81
    Date: 2022
  12. By: Philine Widmer; Sergio Galletta; Elliott Ash
    Abstract: This paper analyzes the influence of partisan content from national cable TV news on local reporting in U.S. newspapers. We provide a new machine-learning-based measure of cable news slant, trained on a corpus of 40K transcribed TV episodes from Fox News Channel (FNC), CNN, and MSNBC (2005-2008). Applying the method to a corpus of 24M local newspaper articles, we find that in response to an exogenous increase in local viewership of FNC relative to CNN/MSNBC, local newspaper articles become more similar to FNC transcripts (and vice versa). Consistent with newspapers responding to changes in reader preferences, we see a shift in the framing of local news coverage rather than just direct borrowing of cable news content. Further, cable news slant polarizes local news content: right-leaning newspapers tend to adopt right-wing FNC language, while left-leaning newspapers tend to become more left-wing. Media slant is contagious.
    Date: 2022–02
  13. By: April Yanyuan Wu; Denise Hoffman; Paul O'Leary; Dara Lee Luca
    Abstract: In this paper, we examine the relationship between self-reported opioid use and employment outcomes among Social Security Disability Insurance (SSDI) applicants. We followed a sample of 2009 applicants to SSDI for four years after the Social Security Administration (SSA) determined their application outcome. We drew our sample from SSA’s Structured Data Repository (SDR) and supplemented the SDR with other SSA administrative data sources that provide information on application outcomes, annual earnings, and deaths. We used a machine-learning method to identify opioids in medication text fields in SDR data. Our analysis addresses two questions: (1) How do employment and earnings patterns differ between SSDI applicants who did and did not use opioids at the time of application? and (2) What is the association between opioid use and employment outcomes among SSDI applicants? We estimated the association between opioid use at application and later employment outcomes through ordinary least squares regression, by using three measures of local opioid availability as instrumental variables and by a reduced-form ordinary least squares regression. Understanding these patterns and associations can improve understanding about the post-application economic well-being of SSDI applicants and may help policymakers identify ways to help this group.
    Date: 2022–02
  14. By: Paola Mallia (PSE - Paris School of Economics - ENPC - École des Ponts ParisTech - ENS Paris - École normale supérieure - Paris - PSL - Université Paris sciences et lettres - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique - EHESS - École des hautes études en sciences sociales - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement, PJSE - Paris Jourdan Sciences Economiques - UP1 - Université Paris 1 Panthéon-Sorbonne - ENS Paris - École normale supérieure - Paris - PSL - Université Paris sciences et lettres - EHESS - École des hautes études en sciences sociales - ENPC - École des Ponts ParisTech - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement)
    Abstract: Adoption of improved seed varieties has the potential to lead to substantial pro ductivity increases in agriculture. However, only 36 percent of the farmers that grow an improved maize variety report doing so in Ethiopia. This paper provides the first causal evidence of the impact of misperception in improved maize varieties on farm ers' production decisions, productivity and profitability. We employ an Instrumental Variable approach that takes advantage of the roll-out of a governmental program that increases transparency in the seed sector. We find that farmers who correctly classify the improved maize variety grown experience large increases in inputs usage (urea, NPS, labor) and yields, but no statistically significant changes in other agricul tural practices or profits. Using machine learning techniques, we develop a model of interpolation to predict objectively measured varietal identification from farmers' self reported data which provides proof-of-concept towards scalable approaches to obtain reliable measures of crop varieties and allows us to extend the analysis to the nationally representative sample.
    Date: 2022–03
  15. By: Abbasi, Ali (Department of Surgery, University of California San Francisco); Gazze, Ludovica (Department of Economics, University of Warwick); Pals, Bridget (School of Law, New York University)
    Abstract: Lead exposure remains a significant threat to children’s health despite decades of policies aimed at getting the lead out of homes and neighborhoods. Generally, lead hazards are identified through inspections triggered by high blood lead levels (BLLs) in children. Yet, it is unclear how best to screen children for lead exposure to balance the costs of screening and the potential benefits of early detection, treatment, and lead hazard removal. While some states require universal screening, others employ a targeted approach, but no regime achieves 100% compliance. We estimate the extent and geographic distribution of undetected lead poisoning in Illinois. We then compare the estimated detection rate of a universal screening program to the current targeted screening policy under different compliance levels. To do so, we link 2010-2016 Illinois lead test records to 2010-2014 birth records, demographics, and housing data. We train a random forest classifier that predicts the likelihood a child has a BLL above 5µg/dL. We estimate that 10,613 untested children had a BLL≥5µg/dL in addition to the 18,115 detected cases. Due to the unequal spatial distribution of lead hazards, 60% of these undetected cases should have been screened under the current policy, suggesting limited benefits from universal screening.
    Keywords: Lead Poisoning ; Environmental Health ; Screening
    Date: 2022
  16. By: KONISHI Yoko; SAITO Takashi; KANAI Hajime; IGEI Naoya; MIZUMURA Junichi; SHIGA Kyoko; SUEYASU Keita; HAMAGUCHI Ryosuke
    Abstract: The COVID-19 pandemic has drastically changed our daily lives in terms of eating, learning, working, and leisure time. Japan has experienced five waves of widespread infection and three emergency declarations but has coped with the crisis in the absence of mandatory lockdowns, behavioral restrictions, and mandatory mask-wearing as seen in other countries. Much of the response has been through behavioral changes in our daily lives. In this paper, we observe the initial disruption, the adaptation period, and the change to a new normal by using "Consumption Big-data." We use POS data from supermarkets, convenience stores, home centers, drugstores, and consumer electronics mass merchandisers, as well as data from household book-keeping applications, for the two-year period from January 2020 to December 2021. The POS data was used to observe item-level sales trends, while the household book-keeping application data was used to observe trends in service expenditures and the prevalence of cashless payments. This made it possible to comprehensively understand the changes of consumer behavior during the pandemic.
    Date: 2021–03

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.