nep-big New Economics Papers
on Big Data
Issue of 2019‒06‒24
thirteen papers chosen by
Tom Coupé
University of Canterbury

  1. Battling antibiotic resistance: can machine learning improve prescribing? By Michael Allan Ribers; Hannes Ullrich
  2. Bright Investments: Measuring the Impact of Transport Infrastructure Using Luminosity Data in Haiti By Mitnik, Oscar A.; Sanchez, Raul; Yañez, Patricia
  3. FinTechs and the Market for Financial Analysis By Jillian Grennan; Roni Michaely
  4. Machine Learning on EPEX Order Books: Insights and Forecasts By Simon Schn\"urch; Andreas Wagner
  5. Online Block Layer Decomposition schemes for training Deep Neural Networks By Laura Palagi; Ruggiero Seccia
  6. The Keys of Predictability: A Comprehensive Study By Giovanni Barone-Adesi; Antonietta Mira; Matteo Pisati
  7. A Machine Learning Analysis of Seasonal and Cyclical Sales in Weekly Scanner Data By Rishab Guha; Serena Ng
  8. Shape Matters: Evidence from Machine Learning on Body Shape-Income Relationship By Suyong Song; Stephen S. Baek
  9. Can Satellite Data Forecast Valuable Information from USDA Reports ? Evidences on Corn Yield Estimates By Pierrick Piette
  10. Mining Family History Society Burials By Gill Newton
  11. Online Job Seekers in Canada: What Can We Learn from Bing Job Queries? By André Binette; Karyne B. Charbonneau; Nicholas Curtis; Gabriela Galassi; Scott Counts; Justin Cranshaw
  12. Value of Data: There's No Such Thing as a Free Lunch in the Digital Economy By Wendy C.Y. LI; NIREI Makoto; YAMANA Kazufumi
  13. Sentiment-Driven Stochastic Volatility Model: A High-Frequency Textual Tool for Economists By Jozef Barunik; Cathy Yi-Hsuan Chen; Jan Vecer

  1. By: Michael Allan Ribers; Hannes Ullrich
    Abstract: Antibiotic resistance constitutes a major health threat. Predicting bacterial causes of infections is key to reducing antibiotic misuse, a leading driver of antibiotic resistance. We train a machine learning algorithm on administrative and microbiological laboratory data from Denmark to predict diagnostic test outcomes for urinary tract infections. Based on predictions, we develop policies to improve prescribing in primary care, highlighting the relevance of physician expertise and policy implementation when patient distributions vary over time. The proposed policies delay antibiotic prescriptions for some patients until test results are known and give them instantly to others. We find that machine learning can reduce antibiotic use by 7.42 percent without reducing the number of treated bacterial infections. As Denmark is one of the most conservative countries in terms of antibiotic use, this result is likely to be a lower bound of what can be achieved elsewhere.
    Keywords: antibiotic prescribing, prediction policy, machine learning, expert decision-making
    JEL: C10 I11 I18 L38 O38 Q28
    Date: 2019
  2. By: Mitnik, Oscar A.; Sanchez, Raul; Yañez, Patricia
    Abstract: This paper quantifies the impacts of transport infrastructure investments on economic activity in Haiti, proxied by satellite luminosity data. Our identification strategy exploits the differential timing of rehabilitation projects across various road segments of the primary road network. We combine multiple sources of non-traditional data and carefully address concerns related to unobserved heterogeneity. The results obtained across multiple specifications consistently indicate that receiving a road rehabilitation project leads to an increase in luminosity values between 7 percent and 15 percent at the communal section level. Taking into account the national level elasticity between luminosity values and GDP, we approximate that these interventions translate in GDP increases of around 0.6 percent and 1.2 percent in communal sections that were benefited by a transport project. Findings also uncover some temporal and spatial variation, showing that effects take some time to appear and that it is not the richest or the poorest communities that are gaining from these investments but those in the middle of the income distribution.
    JEL: O10 R40 O47 D04
    Date: 2019–06
  3. By: Jillian Grennan (Duke University - Fuqua School of Business; Duke Innovation & Entrepreneurship Initiative); Roni Michaely (University of Geneva - Geneva Finance Research Institute (GFRI); Swiss Finance Institute)
    Abstract: Market intelligence FinTechs aggregate many data sources, including nontraditional ones, and synthesize such data using artificial intelligence to make investment recommendations. Using data from a market intelligence FinTech, we evaluate the relationship between the FinTech data coverage and market efficiency. We find an increase in price informativeness for stocks with higher FinTech coverage and that traditional sources of information have less impact on prices for those stocks. Consistent with FinTechs changing investors' behavior, we show a substitution between traditional information sources and FinTechs using internet click data. Overall, our results suggest the rise in FinTechs for investment recommendations benefits investors.
    Keywords: Fintech, FinTechs (financial technology firms), Market intelligence, Artificial intelligence, Aggregators, Social media, Financial blogs, Information and market efficiency, Big data, Machine learning, Datamining, Data signal providers
    JEL: D14 G11 G14 G23
    Date: 2019–03
  4. By: Simon Schn\"urch; Andreas Wagner
    Abstract: This paper employs machine learning algorithms to forecast German electricity spot market prices. The forecasts utilize in particular bid and ask order book data from the spot market but also fundamental market data like renewable infeed and expected demand. Appropriate feature extraction for the order book data is developed. Using cross-validation to optimise hyperparameters, neural networks and random forests are proposed and compared to statistical reference models. The machine learning models outperform traditional approaches.
    Date: 2019–06
  5. By: Laura Palagi (Department of Computer, Control and Management Engineering Antonio Ruberti (DIAG), University of Rome La Sapienza, Rome, Italy); Ruggiero Seccia (Department of Computer, Control and Management Engineering Antonio Ruberti (DIAG), University of Rome La Sapienza, Rome, Italy)
    Abstract: Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on the solution of a very large nonconvex optimization problem that may have many local (no global) minimizers, saddle points and large plateaus. Furthermore, the time needed to find good solutions to the training problem heavily depends on both the number of samples and the number of weights (variables). In this work, we show how Block Coordinate Descent (BCD) methods can be applied to improve the performance of state-of-the-art algorithms by avoiding bad stationary points and flat regions. We first describe a batch BCD method able to effectively tackle difficulties due to the network's depth; then we further extend the algorithm proposing an online BCD scheme able to scale with respect to both the number of variables and the number of samples. We perform extensive numerical results on standard datasets using different deep networks, and we showed how the application of (online) BCD methods to the training phase of DFNNs permits to outperform standard batch/online algorithms leading to an improvement on both the training phase and the generalization performance of the networks.
    Keywords: Deep Feedforward Neural Networks ; Block coordinate decomposition ; Online Optimization ; Large scale optimization
    Date: 2019
  6. By: Giovanni Barone-Adesi (University of Lugano; Swiss Finance Institute); Antonietta Mira (Università della Svizzera italiana - InterDisciplinary Institute of Data Science); Matteo Pisati (Universita' della Svizzera Italiana)
    Abstract: The problem of market predictability can be decomposed into two parts: predictive models and predictors. At first, we show how the joint employment of model selection and machine learning models can dramatically increase our capability to forecast the equity premium out-of-sample. Secondly, we introduce batteries of powerful predictors which brings the monthly S&P500 R-square to a high level of 24%. Finally, we prove how predictability is a generalized characteristic of U.S. equity markets. For each of the three parts, we consider potential and challenges posed by the new approaches in the asset pricing field.
    Keywords: Markets Predictability, Machine Learning, Model Selection
    Date: 2019–03
  7. By: Rishab Guha; Serena Ng
    Abstract: This paper analyzes weekly scanner data collected for 108 groups at the county level between 2006 and 2014. The data display multi-dimensional weekly seasonal effects that are not exactly periodic but are cross-sectionally dependent. Existing univariate procedures are imperfect and yield adjusted series that continue to display strong seasonality upon aggregation. We suggest augmenting the univariate adjustments with a panel data step that pools information across counties. Machine learning tools are then used to remove the within-year seasonal variations. A demand analysis of the adjusted budget shares finds three factors: one that is trending, and two cyclical ones that are well aligned with the level and change in consumer confidence. The effects of the Great Recession vary across locations and product groups, with consumers substituting towards home cooking away from non-essential goods. The adjusted data also reveal changes in spending to unanticipated shocks at the local level. The data are thus informative about both local and aggregate economic conditions once the seasonal effects are removed. The two-step methodology can be adapted to remove other types of nuisance variations provided that these variations are cross-sectionally dependent.
    JEL: E21 E32
    Date: 2019–05
  8. By: Suyong Song; Stephen S. Baek
    Abstract: We study the association between physical appearance and family income using a novel data which has 3-dimensional body scans to mitigate the issue of reporting errors and measurement errors observed in most previous studies. We apply machine learning to obtain intrinsic features consisting of human body and take into account a possible issue of endogenous body shapes. The estimation results show that there is a significant relationship between physical appearance and family income and the associations are different across the gender. This supports the hypothesis on the physical attractiveness premium and its heterogeneity across the gender.
    Date: 2019–06
  9. By: Pierrick Piette (SAF - Laboratoire de Sciences Actuarielle et Financière - UCBL - Université Claude Bernard Lyon 1 - Université de Lyon, LPSM UMR 8001 - Laboratoire de Probabilités, Statistique et Modélisation - UPD7 - Université Paris Diderot - Paris 7 - SU - Sorbonne Université - CNRS - Centre National de la Recherche Scientifique)
    Abstract: On the one hand, recent advances in satellite imagery and remote sensing allow one to easily follow in near-real time the crop conditions all around the world. On the other hand, it has been shown that governmental agricultural reports contain useful news for the commodities market, whose participants react to this valuable information. In this paper, we investigate wether one can forecast some of the newsworthy information contained in the USDA reports through satellite data. We focus on the corn futures market over the period 2000-2016. We first check the well-documented presence of market reactions to the release of the monthly WASDE reports through statistical tests. Then we investigate the informational value of early yield estimates published in these governmental reports. Finally, we propose an econometric model based on MODIS NDVI time series to forecast this valuable information. Results show that market rationally reacts to the NASS early yield forecasts. Moreover, the modeled NDVI-based information is signicantly correlated with the market reactions. To conclude, we propose some ways of improvement to be considered for a practical implementation.
    Keywords: NDVI,USDA reports,MODIS,Market information,Corn,Commodities market
    Date: 2019–06–06
  10. By: Gill Newton (University of Cambridge)
    Abstract: Part I of this paper describes a new 'Big Data' resource for historical mortality, the Family History Society burials dataset. This comprises 8.9 million individual records harmonised from Family History Society transcriptions of burial records in 4,200 English places with varying coverage dates spanning from about 1500 to 2000, and concentrated in the period 1600 to 1850. Adult and child burials have been separately identified using family relationship information, and post-1812 more precise age information is stated. Part II presents an exploratory analysis of burial seasonality and age at death using the Family History Society burials dataset. The seasonality of birth and baptism, which impacts on infant burial seasonality, is also considered using a subsample of four English counties (Suffolk, Cambridgeshire, Nottinghamshire and Lancashire). This research forms part of a Wellcome Trust funded research project led by Richard Smith at CAMPOP entitled ‘Migration, Mortality and Medicalisation: investigating the long-run epidemiological consequences of urbanisation 1600-1945’.
    Keywords: seasonality, mortality, burials, baptisms, big data
    JEL: N33
    Date: 2019–06–07
  11. By: André Binette; Karyne B. Charbonneau; Nicholas Curtis; Gabriela Galassi; Scott Counts; Justin Cranshaw
    Abstract: Labour markets in Canada and around the world are evolving rapidly with the digital economy. Traditional data are adapting gradually but are not yet able to provide timely information on this evolution.
    Keywords: Central bank research; Labour markets; Monetary Policy
    JEL: C80 E24 J21
    Date: 2019–06
  12. By: Wendy C.Y. LI; NIREI Makoto; YAMANA Kazufumi
    Abstract: The Facebook-Cambridge Analytica data scandal demonstrates that there is no such thing as a free lunch in the digital world. Online platform companies exchange "free" digital goods and services for consumer data, reaping potentially significant economic benefits by monetizing data. The proliferation of "free" digital goods and services pose challenges not only to policymakers who generally rely on prices to indicate a good's value but also to corporate managers and investors who need to know how to value data, a key input of digital goods and services. In this research, we first examine the data activities for seven major types of online platforms based on the underlying business models. We show how online platform companies take steps to create the value of data, and present the data value chain to show the value-added activities involved in each step. We find that online platform companies can vary in the degree of vertical integration in the data value chain, and the variation can determine how they monetize their data and how much economic benefit they can capture. Unlike R&D that may depreciate due to obsolescence, data can produce new values through data fusion, a unique feature that creates unprecedented challenges in measurements. Our initial estimates indicate that data can have enormous value. Online platform companies can capture the most benefit from the data because they create the value of the data and because consumers lack knowledge regarding the value of their own data. As trends such as 5G and the Internet of Things are accelerating the accumulation speed of data types and volume, the valuation of data will have important policy implications for investment, trade, and growth.
    Date: 2019–03
  13. By: Jozef Barunik; Cathy Yi-Hsuan Chen; Jan Vecer
    Abstract: We propose how to quantify high-frequency market sentiment using high-frequency news from NASDAQ news platform and support vector machine classifiers. News arrive at markets randomly and the resulting news sentiment behaves like a stochastic process. To characterize the joint evolution of sentiment, price, and volatility, we introduce a unified continuous-time sentiment-driven stochastic volatility model. We provide closed-form formulas for moments of the volatility and news sentiment processes and study the news impact. Further, we implement a simulation-based method to calibrate the parameters. Empirically, we document that news sentiment raises the threshold of volatility reversion, sustaining high market volatility.
    Date: 2019–05

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.