nep-big New Economics Papers
on Big Data
Issue of 2021‒03‒08
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. Who is the Most Sought-After Economist? Ranking Economists Using Google Trends By Tom Coupé
  2. Gender Distribution across Topics in Top 5 Economics Journals: A Machine Learning Approach By J. Ignacio Conde-Ruiz; Juan-José Ganuza; Manu García; Luis A. Puch
  3. Can Machine Learning Catch the COVID-19 Recession? By Philippe Goulet Coulombe; Massimiliano Marcellino; Dalibor Stevanovic
  4. The Gender Pay Gap Revisited with Big Data: Do Methodological Choices Matter? By Strittmatter, Anthony; Wunsch, Conny
  5. Machine Learning and Oil Price Point and Density Forecasting By Alexandre Bonnet R. Costa; Pedro Cavalcanti G. Ferreira; Wagner P. Gaglianone; Osmani Teixeira C. Guillén; João Victor Issler; Yihao Lin
  6. Do gender wage differences within households influence women's empowerment and welfare?: Evidence from Ghana By Michael Danquah; Abdul Malik Iddrisu; Ernest Owusu Boakye; Solomon Owusu
  7. Using machine learning for measuring democracy: An update By Gründler, Klaus; Krieger, Tommy
  8. Investor Confidence and Forecastability of US Stock Market Realized Volatility : Evidence from Machine Learning By Rangan Gupta; Jacobus Nel; Christian Pierdzioch
  9. Machine Learning and Credit Risk: Empirical Evidence from SMEs By Alessandro Bitetto; Paola Cerchiello; Stefano Filomeni; Alessandra Tanda; Barbara Tarantino
  10. BIG (ET) DATA DANS TOUTE SA COMPLEXITE By Cécile Godé; Amandine Pascal
  11. Cross-Fitting and Averaging for Machine Learning Estimation of Heterogeneous Treatment Effects By Jacob, Daniel
  12. Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits By Maria Dimakopoulou; Zhimei Ren; Zhengyuan Zhou
  13. Transfer Learning for Business Cycle Identification By Marcelle Chauvet; Rafael R. S. Guimaraes
  14. Eye in the sky: private satellites and government macro data By Abhiroop Mukherjee; George Panayotov; Janghoon Shon
  15. Deep Learning application for fraud detection in financial statements By Craja, Patricia; Kim, Alisa; Lessmann, Stefan
  16. Risk & Returns around Fomc Press Conferences: A Novel Perspective from Computer Vision By Alexis Marchal
  17. Forecasting financial markets with semantic network analysis in the COVID—19 crisis By Andrea Fronzetti Colladon; Stefano Grassi; Francesco Ravazzolo; Francesco Violante
  18. Governance of Data Sharing : a Law & Economics Proposal By Graef, Inge; Prüfer, Jens
  19. Do Words Hurt More Than Actions? The Impact of Trade Tensions on Financial Markets By Massimo Ferrari; Frederik Kurcz; Maria Sole Pagliari

  1. By: Tom Coupé (University of Canterbury)
    Abstract: This paper uses Google Trends to rank economists and discusses the advantages and disadvantages of using Google Trends compared with other ranking methods, like those based on citations or downloads. I find that search intensity rankings based on Google Trends data are only modestly correlated with more traditional measures of scholarly impact; hence, search intensity statistics can provide additional information, allowing one to show a more comprehensive picture of academics’ impact. In addition, search intensity rankings can help to illustrate the variety in economists’ careers that can lead to fame and allows a comparison of the current impact of both contemporaneous and past economists. Complete rankings can be found at
    Keywords: Economists, rankings, Google Trends, performance measurement
    JEL: A11 B30
    Date: 2021–02–01
  2. By: J. Ignacio Conde-Ruiz; Juan-José Ganuza; Manu García; Luis A. Puch
    Abstract: We analyze all the articles published in Top 5 economic journals between 2002 and 2019 in order to find gender differences in their research approach. Using an unsupervised machine learning algorithm (Structural Topic Model) developed by Roberts et al. (2019) we characterize jointly the set of latent topics that best fits our data (the set of abstracts) and how the documents/abstracts are allocated in each latent topic. This latent topics are mixtures over words were each word has a probability of belonging to a topic after controlling by year and journal. This latent topics may capture research fields but also other more subtle characteristics related to the way in which the articles are written. We find that females are uneven distributed along these latent topics by using only data driven methods. The differences about gender research approaches we found in this paper, are "automatically" generated given the research articles, without an arbitrary allocation to particular categories (as JEL codes, or research areas).
    Keywords: machine learning, structural topic model, gender, research fields
    JEL: I20 J16
    Date: 2021–03
  3. By: Philippe Goulet Coulombe; Massimiliano Marcellino; Dalibor Stevanovic
    Abstract: Based on evidence gathered from a newly built large macroeconomic data set for the UK, labeled UK-MD and comparable to similar datasets for the US and Canada, it seems the most promising avenue for forecasting during the pandemic is to allow for general forms of nonlinearity by using machine learning (ML) methods. But not all nonlinear ML methods are alike. For instance, some do not allow to extrapolate (like regular trees and forests) and some do (when complemented with linear dynamic components). This and other crucial aspects of ML-based forecasting in unprecedented times are studied in an extensive pseudo-out-of-sample exercise.
    Keywords: Machine Learning,Big Data,Forecasting,COVID-19,
    JEL: C53 C55 E37
    Date: 2021–03–02
  4. By: Strittmatter, Anthony (University of St. Gallen); Wunsch, Conny (University of Basel)
    Abstract: The vast majority of existing studies that estimate the average unexplained gender pay gap use unnecessarily restrictive linear versions of the Blinder-Oaxaca decomposition. Using a notably rich and large data set of 1.7 million employees in Switzerland, we investigate how the methodological improvements made possible by such big data affect estimates of the unexplained gender pay gap. We study the sensitivity of the estimates with regard to i) the availability of observationally comparable men and women, ii) model flexibility when controlling for wage determinants, and iii) the choice of different parametric and semi- parametric estimators, including variants that make use of machine learning methods. We find that these three factors matter greatly. Blinder-Oaxaca estimates of the unexplained gender pay gap decline by up to 39% when we enforce comparability between men and women and use a more flexible specification of the wage equation. Semi-parametric matching yields estimates that when compared with the Blinder-Oaxaca estimates, are up to 50% smaller and also less sensitive to the way wage determinants are included.
    Keywords: gender inequality, gender pay gap, common support, model specification, matching estimator, machine learning
    JEL: J31 C21
    Date: 2021–02
  5. By: Alexandre Bonnet R. Costa; Pedro Cavalcanti G. Ferreira; Wagner P. Gaglianone; Osmani Teixeira C. Guillén; João Victor Issler; Yihao Lin
    Abstract: The purpose of this paper is to explore machine learning techniques to forecast the oil price. In the era of big data, we investigate whether new automated tools can improve over traditional approaches in terms of forecast accuracy. Oil price point and density forecasts are built from 22 methods, including regression trees (random forest, quantile regression forest, xgboost), regularization procedures (elastic net, lasso, ridge), standard econometric models and forecast combinations, besides the structural factor model of Schwartz and Smith (2000). The database contains 315 macroeconomic and financial variables, used to build high-dimensional models. To evaluate the predictive power of each method, an extensive pseudo out-of-sample forecasting exercise is built, in monthly and quarterly frequencies, with horizons from one month up to five years. Overall, the results indicate a good performance of the machine learning methods in the short run. Up to six months, the lasso-based models, oil future prices, and the Schwartz-Smith model provide the best forecasts. At longer horizons, forecast combinations also become relevant. In several cases, the accuracy gains in respect to the random walk forecast are statistically significant and reach two-digit figures, in percentage terms, using the R2 out-of-sample statistic; an expressive achievement compared to the previous literature.
    Date: 2021–02
  6. By: Michael Danquah; Abdul Malik Iddrisu; Ernest Owusu Boakye; Solomon Owusu
    Abstract: Using household data from the latest wave of the Ghana Living Standards Survey, this paper utilizes machine learning techniques to examine the effect of gender wage differences within households on women's empowerment and welfare in Ghana. The structural parameters of the post-double selection LASSO estimations show that a reduction in household gender wage gap significantly enhances women's empowerment. Also, a decline in household gender wage gap results meaningfully in improving household welfare.
    Keywords: Gender wage gap, Households, Women's empowerment, Welfare, Machine learning, Ghana
    Date: 2021
  7. By: Gründler, Klaus; Krieger, Tommy
    Abstract: We provide a comprehensive overview of the literature on the measurement of democracy and present an extensive update of the Machine Learning indicator of Gründler and Krieger (2016, European Journal of Political Economy). Four improvements are particularly notable: First, we produce a continuous and a dichotomous version of the Machine Learning democracy indicator. Second, we calculate intervals that reflect the degree of measurement uncertainty. Third, we refine the conceptualization of the Machine Learning Index. Finally, we largely expand the data coverage by providing democracy indicators for 186 countries in the period from 1919 to 2019.
    Keywords: Data aggregation,Democracy indicators,Machine Learning,Measurement Issues,Regime Classifications,Support Vector Machines
    JEL: C38 C43 C82 E02 P16
    Date: 2021
  8. By: Rangan Gupta (Department of Economics, University of Pretoria, Private Bag X20, Hatfield, 0028, South Africa); Jacobus Nel (Department of Economics, University of Pretoria, Private Bag X20, Hatfield, 0028, South Africa); Christian Pierdzioch (Department of Economics, Helmut Schmidt University, Holstenhofweg 85, P.O.B. 700822, 22008 Hamburg, Germany)
    Abstract: Using a machine-learning technique known as random forests, we analyze the role of investor confidence in forecasting monthly aggregate realized stock-market volatility of the United States (US), over and above a wide-array of macroeconomic and financial variables. We estimate random forests on data for a period from 2001 to 2020, and study horizons up to one year by computing forecasts for recursive and a rolling estimation window. We find that investor confidence, and especially investor confidence uncertainty has out-of-sample predictive value for overall realized volatility, as well as its “good†and “bad†variants. Our results have important implications for investors and policymakers.
    Keywords: Investor Confidence, Realized Volatility, Macroeconomic and Financial Predictors, Forecasting, Machine Learning
    JEL: C22 C53 G10 G17
  9. By: Alessandro Bitetto (University of Pavia); Paola Cerchiello (University of Pavia); Stefano Filomeni (University of Essex); Alessandra Tanda (University of Pavia); Barbara Tarantino (University of Pavia)
    Abstract: In this paper we assess credit risk of SMEs by testing and comparing a classic parametric approach fitting an ordered probit model with a non-parametric one calibrating a machine learning historical random forest (HRF) model. We do so by exploiting a unique and proprietary dataset comprising granular firm-level quarterly data collected from a large European bank and an international insurance company on a sample of 810 Italian small- and medium-sized enterprises (SMEs) over the time period 2015-2017. Our results provide novel evidence that a dynamic Historical Random Forest (HRF) approach outperforms the traditional ordered probit model, highlighting how advanced estimation methodologies that use machine learning techniques can be successfully implemented to predict SME credit risk. Moreover, by using Shapley values for the first time, we are able to assess the relevance of each variable in predicting SME credit risk. Traditionally, credit risk evaluation of informationally-opaque SMEs has relied on soft information-intensive relationship banking. However, the advent of large banking conglomerates and the limits to successfully "harden" and transmit soft information across large banking organizations, challenge the traditional role of relationship banking, urging the need to evaluate SME credit risk by implementing alternative methodologies mostly based on hard information.
    Keywords: Credit Rating, SME, Historical Random Forest, Machine Learning, Relationship Banking, Soft Information
    JEL: C52 C53 D82 D83 G21 G22
    Date: 2021–02
  10. By: Cécile Godé (CRET-LOG - Centre de Recherche sur le Transport et la Logistique - AMU - Aix Marseille Université); Amandine Pascal (LEST - Laboratoire d'économie et de sociologie du travail - AMU - Aix Marseille Université - CNRS - Centre National de la Recherche Scientifique)
    Abstract: Les définitions les plus connues du Big Data font une large place à ses caractéristiques d'ampleur ou aux capacités technologiques de stockage et traitement des données massives. Après être revenu sur ces deux dimensions, cette chronique en développe une troisième, peu traitée par la littérature en management : la complexité. Le Big Data ne fait alors plus forcément référence à des ensembles massifs de données mais à des combinaisons de données hétérogènes qui interagissent et se transforment de façon imprévisible. Cette perspective incite à abandonner toute pensée simplifiante pour appréhender pleinement le Big Data et embrasser les nouveaux défis qu'il soulève.
    Date: 2021–02–08
  11. By: Jacob, Daniel
    Abstract: We investigate the finite sample performance of sample splitting, cross-fitting and averaging for the estimation of the conditional average treatment effect. Recently proposed methods, so-called meta- learners, make use of machine learning to estimate different nuisance functions and hence allow for fewer restrictions on the underlying structure of the data. To limit a potential overfitting bias that may result when using machine learning methods, cross- fitting estimators have been proposed. This includes the splitting of the data in different folds to reduce bias and averaging over folds to restore efficiency. To the best of our knowledge, it is not yet clear how exactly the data should be split and averaged. We employ a Monte Carlo study with different data generation processes and consider twelve different estimators that vary in sample-splitting, cross-fitting and averaging procedures. We investigate the performance of each estimator independently on four different meta-learners: the doubly-robust-learner, R-learner, T-learner and X-learner. We find that the performance of all meta-learners heavily depends on the procedure of splitting and averaging. The best performance in terms of mean squared error (MSE) among the sample split estimators can be achieved when applying cross-fitting plus taking the median over multiple different sample-splitting iterations. Some meta-learners exhibit a high variance when the lasso is included in the ML methods. Excluding the lasso decreases the variance and leads to robust and at least competitive results.
    Keywords: causal inference,sample splitting,cross-fitting,sample averaging,machine learning,simulation study
    JEL: C01 C14 C31 C63
    Date: 2020
  12. By: Maria Dimakopoulou; Zhimei Ren; Zhengyuan Zhou
    Abstract: To balance exploration and exploitation, multi-armed bandit algorithms need to conduct inference on the true mean reward of each arm in every time step using the data collected so far. However, the history of arms and rewards observed up to that time step is adaptively collected and there are known challenges in conducting inference with non-iid data. In particular, sample averages, which play a prominent role in traditional upper confidence bound algorithms and traditional Thompson sampling algorithms, are neither unbiased nor asymptotically normal. We propose a variant of a Thompson sampling based algorithm that leverages recent advances in the causal inference literature and adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm -- hence its name doubly-adaptive Thompson sampling. The regret of the proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment based on data from a randomized control trial of a web service is performed: we see that the proposed doubly-adaptive Thompson sampling has superior empirical performance to existing baselines in terms of cumulative regret and statistical power in identifying the best arm. Further, we extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection -- such as the mismatch between the true data generating process and the reward model assumptions or the unequal representations of certain regions of the context space in initial stages of learning -- and propose the linear contextual doubly-adaptive Thompson sampling and the non-parametric contextual doubly-adaptive Thompson sampling extensions of our approach.
    Date: 2021–02
  13. By: Marcelle Chauvet; Rafael R. S. Guimaraes
    Abstract: A transfer learning strategy is proposed to identify business cycles phases when data are limited or there is no business cycle dating committee. The approach integrates the idea of storing knowledge gained from one region’s economics experts and applying it to other geographic areas. The first is captured with a supervised deep neural network model, and the second by applying it to another dataset, a domain adaptation procedure. The results indicate the method proposed leads to successful business cycle identification.
    Date: 2021–02
  14. By: Abhiroop Mukherjee (Department of Finance, The Hong Kong University of Science and Technology.); George Panayotov (Department of Finance, The Hong Kong University of Science and Technology.); Janghoon Shon (Department of Finance, The Hong Kong University of Science and Technology.)
    Abstract: We develop an approach to identify whether recent technological advancements – such as the rise of commercial satellite-based macroeconomic estimates – can provide an effective alternative to government data. We measure the extent to which satellite estimates are affecting the value of government macro news using the asset price impact of scheduled announcements. Our identification relies on cloud cover, which prevents satellites from observing economic activity at a few key hubs. Applying our approach, we find that some satellite estimates are now so effective that markets are no longer surprised by government announcements. Our results point to a future in which the resolution of macro uncertainty is smoother, and governments have less control over macro information.
    Keywords: Alternative data, Satellite Imagery, Asset price impact, Macroeconomic Estimates
    JEL: G14 E44
    Date: 2019–10
  15. By: Craja, Patricia; Kim, Alisa; Lessmann, Stefan
    Abstract: Financial statement fraud is an area of significant consternation for potential investors, auditing companies, and state regulators. Intelligent systems facilitate detecting financial statement fraud and assist the decision-making of relevant stakeholders. Previous research detected instances in which financial statements have been fraudulently misrepresented in managerial comments. The paper aims to investigate whether it is possible to develop an enhanced system for detecting financial fraud through the combination of information sourced from financial ratios and managerial comments within corporate annual reports. We employ a hierarchical attention network (HAN) with a long short-term memory (LSTM) encoder to extract text features from the Management Discussion and Analysis (MD&A) section of annual reports. The model is designed to offer two distinct features. First, it reflects the structured hierarchy of documents, which previous models were unable to capture. Second, the model embodies two different attention mechanisms at the word and sentence level, which allows content to be differentiated in terms of its importance in the process of constructing the document representation. As a result of its architecture, the model captures both content and context of managerial comments, which serve as supplementary predictors to financial ratios in the detection of fraudulent reporting. Additionally, the model provides interpretable indicators denoted as “red-flag” sentences, which assist stakeholders in their process of determining whether further investigation of a specific annual report is required. Empirical results demonstrate that textual features of MD&A sections extracted by HAN yield promising classification results and substantially reinforce financial ratios.
    Keywords: fraud detection,financial statements,deep learning,text analytics
    JEL: C00
    Date: 2020
  16. By: Alexis Marchal (EPFL; SFI)
    Abstract: I propose a new tool to characterize the resolution of uncertainty around FOMC press conferences. It relies on the construction of a measure capturing the level of discussion complexity between the Fed Chair and reporters during the Q&A sessions. I show that complex discussions are associated with higher equity returns and a drop in realized volatility. The method creates an attention score by quantifying how much the Chair needs to rely on reading internal documents to be able to answer a question. This is accomplished by building a novel dataset of video images of the press conferences and leveraging recent deep learning algorithms from computer vision. This alternative data provides new information on nonverbal communication that cannot be extracted from the widely analyzed FOMC transcripts. This paper can be seen as a proof of concept that certain videos contain valuable information for the study of financial markets.
    Keywords: FOMC, Machine learning, Computer vision, Alternative data, Asset pricing, Equity premium.
    JEL: C45 C55 C80 E58 G12 G14
    Date: 2021–03
  17. By: Andrea Fronzetti Colladon (University of Perugia); Stefano Grassi (University of Rome Tor Vergata); Francesco Ravazzolo (Free University of Bozen—Bolzano and CAMP, BI Norwegian Business School); Francesco Violante (CREST, GENES, ENSAE Paris, Institut Polytechnique de Paris and CREATES - Aarhus University)
    Abstract: This paper uses a new textual data index for predicting stock market data. The index is applied to a large set of news to evaluate the importance of one or more general economic related keywords appearing in the text. The index assesses the importance of the economic related keywords, based on their frequency of use and semantic network position. We apply it to the Italian press and construct indices to predict Italian stock and bond market returns and volatilities in a recent sample period, including the COVID—19 crisis. The evidence ShOWS that the index captures the different phases of financial time series well. Moreover, results indicate strong evidence of predictability for bond market data, both returns and volatilities, Short and long maturities, and stock market volatility.
    Date: 2021–03–03
  18. By: Graef, Inge (Tilburg University, Center For Economic Research); Prüfer, Jens (Tilburg University, Center For Economic Research)
    Keywords: Data sharing; data-driven markets; economic governance; competition law; data protection; regulation
    Date: 2021
  19. By: Massimo Ferrari; Frederik Kurcz; Maria Sole Pagliari
    Abstract: In this paper, we apply textual analysis and machine learning algorithms to construct an index capturing trade tensions between US and China. Our indicator matches well-known events in the US-China trade dispute and is exogenous to the developments on global financial markets. By means of local projection methods, we show that US markets are largely unaffected by rising trade tensions, with the exception of those firms that are more exposed to China, while the same shock negatively affects stock market indices in EMEs and China. Higher trade tensions also entail: i) an appreciation of the US dollar; ii) a depreciation of EMEs currencies; iii) muted changes in safe haven currencies; iv) portfolio re-balancing between stocks and bonds in the EMEs. We also show that trade tensions account for around 15% of the variance of Chinese stocks while their contribution is muted for US markets. These findings suggest that the US-China trade tensions are interpreted as a negative demand shock for the Chinese economy rather than as a global risk shock.
    Keywords: Trade Shocks; Machine Learning; Stock Indexes; Exchange Rates.
    JEL: D53 E44 F13 F14 C55
    Date: 2021

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.