nep-big New Economics Papers
on Big Data
Issue of 2021‒11‒29
twenty-one papers chosen by
Tom Coupé
University of Canterbury

  1. A Sentiment-Enhanced Corruption Perception Index By Zaijin Zhan; Sandile Hlatshwayo; Ms. Yingjie Fan; Yongquan Cao; Monica Petrescu
  2. Joint Models for Cause-of-Death Mortality in Multiple Populations By Nhan Huynh; Mike Ludkovski
  3. Tackling Large Outliers in Macroeconomic Data with Vector Artificial Neural Network Autoregression By Vito Polito; Yunyi Zhang
  4. Learning to Play the Box-Sizing Game: A Machine Learning Approach for Solving the E-commerce Packaging Problem By Kandula, Shanthan; Krishnamoorthy, Srikumar; Roy, Debjit
  5. Some Children Left Behind: Variation in the Effects of an Educational Intervention By Julie Buhl-Wiggers; Jason T. Kerwin; Juan S. Muñoz-Morales; Jeffrey A. Smith; Rebecca Thornton
  6. A ThousandWords Tell More Than Just Numbers: Financial Crises and Historical Headlines By Kim Ristolainen; Tomi Roukka; Henri Nyberg
  7. Artificial Intelligence, Surveillance, and Big Data By David Karpa; Torben Klarl; Michael Rochlitz
  8. Disagreement inside the FOMC: New Insights from Tone Analysis By Davide Romelli; Hamza Bennani
  9. Differing roles of lifelong learning: Hedging against unemployment risks from skill obsolescence or boosting upward career mobility? By Tobias Schultheiss; Uschi Backes-Gellner
  10. Advanced statistical learning on short term load process forecasting By Hu, Junjie; López Cabrera, Brenda; Melzer, Awdesch
  11. Labour-saving automation and occupational exposure: a text-similarity measure By Fabio Montobbio; Jacopo Staccioli; Maria Enrica Virgillito; Marco Vivarelli
  12. Can satellite data on air pollution predict industrial production? By Jean-Charles Bricongne; Baptiste Meunier; Thomas Pical
  13. What Drives Financial Sector Development in Africa? Insights from Machine Learning By Isaac K. Ofori; Christopher Quaidoo; Pamela E. Ofori
  14. AI-tocracy By Martin Beraja; Andrew Kao; David Y. Yang; Noam Yuchtman
  15. Low-Acuity Patients Delay High-Acuity Patients in EDs By Luo, Danqi; Bayati, Mohsen; Plambeck, Erica L.; Aratow, Michael
  16. Stock Price Prediction Using Time Series, Econometric, Machine Learning, and Deep Learning Models By Ananda Chatterjee; Hrisav Bhowmick; Jaydip Sen
  17. Location inference on social media data for agile monitoring of public health crises: An application to opioid use and abuse during the Covid-19 pandemic By Angela E. Kilby; Charlie Denhart
  18. FinEAS: Financial Embedding Analysis of Sentiment By Asier Guti\'errez-Fandi\~no; Miquel Noguer i Alonso; Petter Kolm; Jordi Armengol-Estap\'e
  19. Financial condition indices for emerging market economies: can Google help? By Fabrizio Ferriani; Andrea Gazzani
  20. Behavioral Targeting, Machine Learning and Regression Discontinuity Designs By Narayanan, Sridhar; Kalyanam, Kirthi
  21. Self-organised criticality in high frequency finance: the case of flash crashes By Jeremy D. Turiel; Tomaso Aste

  1. By: Zaijin Zhan; Sandile Hlatshwayo; Ms. Yingjie Fan; Yongquan Cao; Monica Petrescu
    Abstract: Direct measurement of corruption is difficult due to its hidden nature, and measuring the perceptions of corruption via survey-based methods is often used as an alternative. This paper constructs a new non-survey based perceptions index for 111 countries by applying sentiment analysis to Financial Times articles over 2005–18. This sentiment-enhanced corruption perception index (SECPI) captures not only the frequncy of corruption related articles, but also the articles’ sentiment towards corruption. This index, while correlated with existing corruption perception indexes, offers some distinct advantages, including heightened sensitivity to current events (e.g., corruption investigations and elections), availability at a higher frequency, and lower costs to update. The SECPI is negatively correlated with business environment and institutional quality. Increases in the perceived incidence or scope of corruption influences economic agents’ behaviors, and thus economic dynamics. We found that when the SECPI is at least one standard deviation above the mean, the growth per capita falls by 0.65 percentage point on average, with more pronounced impacts for emerging market and low income countries.
    Keywords: sentiment analysis method; perception index; statistic department; articles' sentiment; summary statistics; Corruption; Emerging and frontier financial markets; Business environment; Global
    Date: 2021–07–23
  2. By: Nhan Huynh; Mike Ludkovski
    Abstract: We investigate jointly modeling Age-specific rates of various causes of death in a multinational setting. We apply Multi-Output Gaussian Processes (MOGP), a spatial machine learning method, to smooth and extrapolate multiple cause-of-death mortality rates across several countries and both genders. To maintain flexibility and scalability, we investigate MOGPs with Kronecker-structured kernels and latent factors. In particular, we develop a custom multi-level MOGP that leverages the gridded structure of mortality tables to efficiently capture heterogeneity and dependence across different factor inputs. Results are illustrated with datasets from the Human Cause-of-Death Database (HCD). We discuss a case study involving cancer variations in three European nations, and a US-based study that considers eight top-level causes and includes comparison to all-cause analysis. Our models provide insights into the commonality of cause-specific mortality trends and demonstrate the opportunities for respective data fusion.
    Date: 2021–11
  3. By: Vito Polito; Yunyi Zhang
    Abstract: We develop a regime switching vector autoregression where artificial neural networks drive time variation in the coefficients of the conditional mean of the endogenous variables and the variance covariance matrix of the disturbances. The model is equipped with a stability constraint to ensure non-explosive dynamics. As such, it is employable to account for nonlinearity in macroeconomic dynamics not only during typical business cycles but also in a wide range of extreme events, like deep recessions and strong expansions. The methodology is put to the test using aggregate data for the United States that include the abnormal realizations during the recent Covid-19 pandemic. The model delivers plausible and stable structural inference, and accurate out-of-sample forecasts. This performance compares favourably against a number of alternative methodologies recently proposed to deal with large outliers in macroeconomic data caused by the pandemic.
    Keywords: nonlinear time series, regime switching models, extreme events, Covid-19, macroeconomic forecasting
    JEL: C45 C50 E37
    Date: 2021
  4. By: Kandula, Shanthan; Krishnamoorthy, Srikumar; Roy, Debjit
    Abstract: E-commerce packages are notorious for their inefficient usage of space. More than one-quarter volume of a typical e-commerce package comprises air and filler material. The inefficient usage of space significantly reduces the transportation and distribution capacity increasing the operational costs. Therefore, designing an optimal set of packaging box sizes is imperative for improving efficiency. Though prior approaches for determining the optimal box sizes exist, they cannot be applied due to the wide range of SKUs hosted by the e-commerce warehouses. Besides, designing a few tens of boxes for covering hundreds of thousands of SKUs that span a wide range of sizes is impractical with the integer programming formulations used by the conventional approaches. This article proposes a scalable three-stage optimization framework that combines unsupervised learning, reinforcement learning, and tree search to design optimal box sizes. More specifically, the package optimization problem is formulated into a sequential decision-making task called the box-sizing game. A neural network agent is then designed to play the game and learn control policies to solve the problem. In addition, a tree-search operator is developed to improve the performance of the learned policies. The proposed framework is evaluated on real-world and synthetic datasets against standard metaheuristics and industry benchmarks. Results indicate the robustness and superiority of the approach in generating industry-strength solutions. Specifically, the packaging box assortments generated by the framework are 5% to 7.5% better than the industry baselines.
    Date: 2021–11–17
  5. By: Julie Buhl-Wiggers; Jason T. Kerwin; Juan S. Muñoz-Morales; Jeffrey A. Smith; Rebecca Thornton
    Abstract: We document substantial variation in the effects of a highly-effective literacy pro-gram in northern Uganda. The program increases test scores by 1.40 SDs on average, but standard statistical bounds show that the impact standard deviation exceeds 1.0SD. This implies that the variation in effects across our students is wider than the spread of mean effects across all randomized evaluations of developing country education interventions in the literature. This very effective program does indeed leave some students behind. At the same time, we do not learn much from our analyses that attempt to determine which students benefit more or less from the program. We reject rank preservation, and the weaker assumption of stochastic increasingness leaves wide bounds on quantile-specific average treatment effects. Neither conventional nor machine-learning approaches to estimating systematic heterogeneity capture more than a small fraction of the variation in impacts given our available candidate moderators.
    JEL: C18 C21 I21 I25 J24
    Date: 2021–11
  6. By: Kim Ristolainen (Department of Economics, Turku School of Economics, University of Turku, Finland); Tomi Roukka (Department of Economics, Turku School of Economics, University of Turku, Finland); Henri Nyberg (Department of Mathematics and Statistics, University of Turku, Finland)
    Abstract: We show that financial crises are preceded by changes in specific types of narrative information contained in newspaper article titles. Our novel international dataset and the resulting empirical evidence are gathered by integrating information from a large panel of economic news articles in global newspapers between the years 1870 and 2016 with conventional macroeconomic and financial indicators. We find that the predictive information of newspaper article titles that signals coming crisis episodes is substantial over and above the macroeconomic and financial indicators. The new indicators capture common features that have often been discussed as potential causes of specific crises but which have not been incorporated into empirical models.
    Keywords: financial crisis, text data, leading indicators, topic model
    JEL: G00 G01 N01 C25 C82
    Date: 2021–11
  7. By: David Karpa; Torben Klarl; Michael Rochlitz
    Abstract: The most important resource to improve technologies in the field of artificial intelligence is data. Two types of policies are crucial in this respect: privacy and data-sharing regulations, and the use of surveillance technologies for policing. Both types of policies vary substantially across countries and political regimes. In this chapter, we examine how authoritarian and democratic political institutions can influence the quality of research in artificial intelligence, and the availability of large-scale datasets to improve and train deep learning algorithms. We focus mainly on the Chinese case, and find that -- ceteris paribus -- authoritarian political institutions continue to have a negative effect on innovation. They can, however, have a positive effect on research in deep learning, via the availability of large-scale datasets that have been obtained through government surveillance. We propose a research agenda to study which of the two effects might dominate in a race for leadership in artificial intelligence between countries with different political institutions, such as the United States and China.
    Date: 2021–11
  8. By: Davide Romelli (Department of Economics, Trinity College Dublin); Hamza Bennani (School of Economics and Management (IAE), University of Nantes)
    Abstract: This paper analyses the drivers of divergence in tone among Federal Open Market Committee (FOMC) members using text analysis tools. We use a financial dictionary to measure the tone of FOMC transcripts at the speaker-meeting-round level. We then relate the tone of FOMC members’ remarks with their individual projections for inflation and unemployment rate. Our results show a positive relationship between inflation projections and the tone used by FOMC members, suggesting that divergence in tone among members is mainly driven by differences in their projected levels of inflation. We also show that Federal Reserve Bank presidents and voting members are those who use a more distinct tone, in particular during the economics go-round.
    Keywords: central banks, monetary policy committees, federal reserve, fomc.
    JEL: E52 E58
    Date: 2021–09
  9. By: Tobias Schultheiss; Uschi Backes-Gellner
    Abstract: This paper examines the role of lifelong learning in counteracting skill depreciation and obsolescence. We build on findings showing that different skill types have structurally different depreciation rates. We differentiate between hard and soft skills and measure the relative importance of these two skill types at the occupational level. As data source we draw on a large sample of job advertisements and a categorization of their skill requirements through a machine-learning algorithm. We analyze lifelong learning effects for "harder" occupations (with relatively more hard than soft skills) versus "softer" occupations. Our results reveal important patterns of skill depreciation and counteracting lifelong learning effects: In harder occupations, the role of lifelong learning is primarily as a hedge against unemployment risks caused by fast-depreciating hard skills; in softer occupations, this role instead lies mostly in acting as a boost to wage gains and upward career mobility as workers build on a value-stable skill foundation.
    JEL: I26 J24
    Date: 2021–11
  10. By: Hu, Junjie; López Cabrera, Brenda; Melzer, Awdesch
    Abstract: Short Term Load Forecast (STLF) is necessary for effective scheduling, operation optimization trading, and decision-making for electricity consumers. Modern and efficient machine learning methods are recalled nowadays to manage complicated structural big datasets, which are characterized by having a nonlinear temporal dependence structure. We propose different statistical nonlinear models to manage these challenges of hard type datasets and forecast 15-min frequency electricity load up to 2-days ahead. We show that the Long-short Term Memory (LSTM) and the Gated Recurrent Unit (GRU) models applied to the production line of a chemical production facility outperform several other predictive models in terms of out-of-sample forecasting accuracy by the Diebold-Mariano (DM) test with several metrics. The predictive information is fundamental for the risk and production management of electricity consumers.
    Keywords: Short Term Load Forecast,Deep Neural Network,Hard Structure Load Process
    JEL: C51 C52 C53 Q31 Q41
    Date: 2021
  11. By: Fabio Montobbio; Jacopo Staccioli; Maria Enrica Virgillito; Marco Vivarelli
    Abstract: This paper represents one of the first attempts at building a direct measure of occupational exposure to robotic labour-saving technologies. After identifying robotic and labour-saving robotic patents retrieved by Montobbio et al., (2022), the underlying 4-digit CPC definitions are employed in order to detect functions and operations performed by technological artefacts which are more directed to substitute the labour input. This measure allows to obtain fine-grained information on tasks and occupations according to their similarity ranking. Occupational exposure by wage and employment dynamics in the United States is then studied, complemented by investigating industry and geographical penetration rates.
    Keywords: Labour-Saving Technology; Natural Language Processes; Labour Markets; Technological Unemployment.
    Date: 2021–11–23
  12. By: Jean-Charles Bricongne; Baptiste Meunier; Thomas Pical
    Abstract: The Covid-19 crisis has highlighted innovative high-frequency dataset allowing to measure in real-time the economic impact. In this vein, we explore how satellite data measuring the concentration of nitrogen dioxide (NO2, a pollutant emitted mainly by industrial activity) in the troposphere can help predict industrial production. We first show how such data must be adjusted for meteorological patterns which can alter data quality and pollutant emissions. We use machine learning techniques to better account for non-linearities and interactions between variables. We then find evidence that nowcasting performances for monthly industrial production are significantly improved when relying on daily NO2 data compared to benchmark models based on PMIs and auto-regressive (AR) terms. We also find evidence of heterogeneities suggesting that the contribution of daily pollution data is particularly important during “crisis” episodes and that the elasticity of NO2 pollution to industrial production for a country depends on the share of manufacturing in the value added. Available daily, free-to-use, granular and covering all countries including those with limited statistics, this paper illustrates the potential of satellite-based data for air pollution in enhancing the real-time monitoring of economic activity.
    Keywords: Data Science, Big Data, Satellite Data, Nowcasting, Machine Learning, Industrial Production
    JEL: C51 C81 E23 E37
    Date: 2021
  13. By: Isaac K. Ofori (University of Insubria, Varese, Italy); Christopher Quaidoo (University of Insubria, Varese, Italy); Pamela E. Ofori (University of Insubria, Varese, Italy)
    Abstract: This study uses machine learning techniques to identify the key drivers of financial development in Africa. To this end, four regularization techniques— the Standard lasso, Adaptive lasso, the minimum Schwarz Bayesian information criterion lasso, and the Elasticnet are trained based on a dataset containing 86 covariates of financial development for the period 1990 – 2019. The results show that variables such as cell phones, economic globalisation, institutional effectiveness, and literacy are crucial for financial sector development in Africa. Evidence from the Partialing-out lasso instrumental variable regression reveals that while inflation and agricultural sector employment suppress financial sector development, cell phones and institutional effectiveness are remarkable in spurring financial sector development in Africa. Policy recommendations are provided in line with the rise in globalisation, and technological progress in Africa.
    Keywords: Africa, Elasticnet, Financial Development, Financial Inclusion, Lasso, Regularization, Variable Selection
    JEL: C01 C14 C52 C53 C55 E5 O55
    Date: 2021–01
  14. By: Martin Beraja; Andrew Kao; David Y. Yang; Noam Yuchtman
    Abstract: Can frontier innovation be sustained under autocracy? We argue that innovation and autocracy can be mutually reinforcing when: (i) the new technology bolsters the autocrat’s power; and (ii) the autocrat’s demand for the technology stimulates further innovation in applications beyond those benefiting it directly. We test for such a mutually reinforcing relationship in the context of facial recognition AI in China. To do so, we gather comprehensive data on AI firms and government procurement contracts, as well as on social unrest across China during the last decade. We first show that autocrats benefit from AI: local unrest leads to greater government procurement of facial recognition AI, and increased AI procurement suppresses subsequent unrest. We then show that AI innovation benefits from autocrats’ suppression of unrest: the contracted AI firms innovate more both for the government and commercial markets. Taken together, these results suggest the possibility of sustained AI innovation under the Chinese regime: AI innovation entrenches the regime, and the regime’s investment in AI for political control stimulates further frontier innovation.
    JEL: E00 L5 L63 O25 O30 O40 P00
    Date: 2021–11
  15. By: Luo, Danqi (Stanford U); Bayati, Mohsen (Stanford U); Plambeck, Erica L. (Stanford U); Aratow, Michael (San Mateo Medical Center)
    Abstract: This paper provides evidence that the arrival of an additional low-acuity patient substantially increases the wait time to start of treatment for high-acuity patients, contradicting the long-standing prior conclusion in the medical literature that the effect is "negligible." Whereas the medical literature underestimates the effect by neglecting how delay propagates in a queuing system, this paper develops and validates a new estimation method based on queuing theory, machine learning and causal inference. Wait time information displayed to low-acuity patients provides a quasi-randomized instrumental variable. This paper shows that a low-acuity patient increases wait times for high-acuity patients through: pre-triage delay; delay of lab tests ordered for high-acuity patients; and transition delay when an ED interrupts treatment of a low-acuity patient in order to treat a high-acuity patient. Hence high-acuity patients' wait times could be reduced by: reducing the standard deviation or mean of those transition delays, particularly in bed-changeover; providing vertical or "fast track" treatment for more low-acuity patients, especially ESI 3 patients; standardizing providers' test-ordering for low-acuity patients; and designing wait time information systems to divert (especially when the ED is highly congested) low-acuity patients that do not need ED treatment.
    Date: 2021
  16. By: Ananda Chatterjee; Hrisav Bhowmick; Jaydip Sen
    Abstract: For a long-time, researchers have been developing a reliable and accurate predictive model for stock price prediction. According to the literature, if predictive models are correctly designed and refined, they can painstakingly and faithfully estimate future stock values. This paper demonstrates a set of time series, econometric, and various learning-based models for stock price prediction. The data of Infosys, ICICI, and SUN PHARMA from the period of January 2004 to December 2019 was used here for training and testing the models to know which model performs best in which sector. One time series model (Holt-Winters Exponential Smoothing), one econometric model (ARIMA), two machine Learning models (Random Forest and MARS), and two deep learning-based models (simple RNN and LSTM) have been included in this paper. MARS has been proved to be the best performing machine learning model, while LSTM has proved to be the best performing deep learning model. But overall, for all three sectors - IT (on Infosys data), Banking (on ICICI data), and Health (on SUN PHARMA data), MARS has proved to be the best performing model in sales forecasting.
    Date: 2021–11
  17. By: Angela E. Kilby; Charlie Denhart
    Abstract: The Covid-19 pandemic has intersected with the opioid epidemic to create a unique public health crisis, with the health and economic consequences of the virus and associated lockdowns compounding pre-existing social and economic stressors associated with rising opioid and heroin use and abuse. In order to better understand these interlocking crises, we use social media data to extract qualitative and quantitative insights on the experiences of opioid users during the Covid-19 pandemic. In particular, we use an unsupervised learning approach to create a rich geolocated data source for public health surveillance and analysis. To do this we first infer the location of 26,000 Reddit users that participate in opiate-related sub-communities (subreddits) by combining named entity recognition, geocoding, density-based clustering, and heuristic methods. Our strategy achieves 63 percent accuracy at state-level location inference on a manually-annotated reference dataset. We then leverage the geospatial nature of our user cohort to answer policy-relevant questions about the impact of varying state-level policy approaches that balance economic versus health concerns during Covid-19. We find that state government strategies that prioritized economic reopening over curtailing the spread of the virus created a markedly different environment and outcomes for opioid users. Our results demonstrate that geospatial social media data can be used for agile monitoring of complex public health crises.
    Date: 2021–11
  18. By: Asier Guti\'errez-Fandi\~no; Miquel Noguer i Alonso; Petter Kolm; Jordi Armengol-Estap\'e
    Abstract: We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS). In financial markets, news and investor sentiment are significant drivers of security prices. Thus, leveraging the capabilities of modern NLP approaches for financial sentiment analysis is a crucial component in identifying patterns and trends that are useful for market participants and regulators. In recent years, methods that use transfer learning from large Transformer-based language models like BERT, have achieved state-of-the-art results in text classification tasks, including sentiment analysis using labelled datasets. Researchers have quickly adopted these approaches to financial texts, but best practices in this domain are not well-established. In this work, we propose a new model for financial sentiment analysis based on supervised fine-tuned sentence embeddings from a standard BERT model. We demonstrate our approach achieves significant improvements in comparison to vanilla BERT, LSTM, and FinBERT, a financial domain specific BERT.
    Date: 2021–10
  19. By: Fabrizio Ferriani (Bank of Italy); Andrea Gazzani (Bank of Italy)
    Abstract: We compare different approaches to constructing financial condition indices (FCIs) for major emerging market economies (EMEs). We further test whether measures of web-search intensity for keywords related to financial tensions can complement the information content of traditional financial variables. We find that an index constructed as a simple average of key financial variables augmented with data from Google searches outperforms several alternative definitions of FCIs in explaining business cycle fluctuations and capital flows episodes. These results hold true when controlling for proxies of the global financial cycle, highlighting that local financial market conditions are important for the macroeconomic performance of EMEs
    Keywords: financial condition index, emerging markets, Google search, principal component analysis, VAR, quantile regressions
    JEL: C51 E44 F30 G01 G15
    Date: 2021–11
  20. By: Narayanan, Sridhar (Stanford U); Kalyanam, Kirthi (Santa Clara U)
    Abstract: The availability of behavioral and other data on customers and advances in machine learning methods have enabled targeting of customers in a variety of domains, including pricing, advertising, recommendation systems and personal selling contexts. Typically, such targeting involves first training a machine learning algorithm on a training dataset, and then using that algorithm to score current or potential customers. When the score crosses a threshold, a treatment (such as an offer, an advertisement or a recommendation) is assigned. In this paper, we demonstrate that this has given rise to opportunities for causal measurement of the effects of such targeted treatments using regression discontinuity designs (RDD). Investigating machine learning in a regression discontinuity framework leads to several insights. First, we characterize conditions under which regression discontinuity designs can be used to measure not just local average treatment effects (LATE), but also average treatment effects (ATE). In some situations, we show that RD can be used to find bounds on the ATE even if we are unable to find point estimates. We then apply this to the machine learning based targeting contexts by studying two different ways in which the score required for targeting is generated, and explore the utility of RDD to these contexts. Finally, we apply our approach in the empirical context of the targeting of retargeted display advertising. Using a dataset from a context where a machine learning based targeting policy was employed in parallel with a randomized controlled trial, we examine the performance of the RDD estimate in estimating the treatment effect, validate it using a placebo test and demonstrate its practical utility.
    Date: 2020–12
  21. By: Jeremy D. Turiel; Tomaso Aste
    Abstract: With the rise of computing and artificial intelligence, advanced modeling and forecasting has been applied to High Frequency markets. A crucial element of solid production modeling though relies on the investigation of data distributions and how they relate to modeling assumptions. In this work we investigate volume distributions during anomalous price events and show how their tail exponents
    Date: 2021–10

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.