nep-big New Economics Papers
on Big Data
Issue of 2020‒08‒31
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. Crowd, Lending, Machine, and Bias By Runshan Fu; Yan Huang; Param Vir Singh
  2. The Effect of COVID-19 Lockdown on Mobility and Traffic Accidents: Evidence from Louisiana By Shafiullah Qureshi; Ba M Chu; Fanny S. Demers
  3. Management accounting and the idea of machine learning By Steen Nielsen
  4. An Application of High-Dimensional Statistics to Predictive Modeling of Grade Variability By Juri Hinz; Igor Grigoryev; Alexander Novikov
  5. Macroeconomic Data Transformations Matter By Philippe Goulet Coulombe; Maxime Leroux; Dalibor Stevanovic; Stéphane Surprenant
  6. Nowcasting with large Bayesian vector autoregressions By Cimadomo, Jacopo; Giannone, Domenico; Lenza, Michele; Sokol, Andrej; Monti, Francesca
  7. Supervised Machine Learning Techniques: An Overview with Applications to Banking By Linwei Hu; Jie Chen; Joel Vaughan; Hanyu Yang; Kelly Wang; Agus Sudjianto; Vijayan N. Nair
  8. Machine Learning Panel Data Regressions with an Application to Nowcasting Price Earnings Ratios By Andrii Babii; Ryan T. Ball; Eric Ghysels; Jonas Striaukas
  9. Misogynistic and Xenophobic Hate Language Online: A Matter of Anonymity By von Essen, Emma; Jansson, Joakim
  10. Generating Trading Signals by ML algorithms or time series ones? By Omid Safarzadeh
  11. Alternative Methods for Studying Consumer Payment Choice By Oz Shy
  12. Leveraging the Power of Place: A Data-Driven Decision Helper to Improve the Location Decisions of Economic Immigrants By Jeremy Ferwerda; Nicholas Adams-Cohen; Kirk Bansak; Jennifer Fei; Duncan Lawrence; Jeremy M. Weinstein; Jens Hainmueller
  13. Dynamic Factor Trees and Forests – A Theory-led Machine Learning Framework for Non-Linear and State-Dependent Short-Term U.S. GDP Growth Predictions By Daniel Wochner
  14. Developing a real estate yield investment deviceusing granular data and machine learning By Monica Azqueta-Gavaldon; Gonzalo Azqueta-Gavaldon; Inigo Azqueta-Gavaldon; Andres Azqueta-Gavaldon
  15. Short-term forecasting of the COVID-19 pandemic using Google Trends data: Evidence from 158 countries By Fantazzini, Dean
  16. A tale of three countries: How did Covid-19 lockdown impact happiness? By Greyling, Talita; Rossouw, Stephanie; Adhikari, Tamanna
  17. Reinforcement Learning in Limit Order Markets By Xue-Zhong He; Shen Lin

  1. By: Runshan Fu; Yan Huang; Param Vir Singh
    Abstract: Big data and machine learning (ML) algorithms are key drivers of many fintech innovations. While it may be obvious that replacing humans with machine would increase efficiency, it is not clear whether and where machines can make better decisions than humans. We answer this question in the context of crowd lending, where decisions are traditionally made by a crowd of investors. Using data from, we show that a reasonably sophisticated ML algorithm predicts listing default probability more accurately than crowd investors. The dominance of the machine over the crowd is more pronounced for highly risky listings. We then use the machine to make investment decisions, and find that the machine benefits not only the lenders but also the borrowers. When machine prediction is used to select loans, it leads to a higher rate of return for investors and more funding opportunities for borrowers with few alternative funding options. We also find suggestive evidence that the machine is biased in gender and race even when it does not use gender and race information as input. We propose a general and effective "debasing" method that can be applied to any prediction focused ML applications, and demonstrate its use in our context. We show that the debiased ML algorithm, which suffers from lower prediction accuracy, still leads to better investment decisions compared with the crowd. These results indicate that ML can help crowd lending platforms better fulfill the promise of providing access to financial resources to otherwise underserved individuals and ensure fairness in the allocation of these resources.
    Date: 2020–07
  2. By: Shafiullah Qureshi (Department of Economics, Carleton University); Ba M Chu (Department of Economics, Carleton University); Fanny S. Demers (Department of Economics, Carleton University)
    Abstract: The objective of this paper is to apply state-of-the-art machine-learning (ML) algorithms to predict the monthly and quarterly real GDP growth of Canada using both Google trends (GT) and official data that are available ahead of the release of GDP data by Statistics Canada. This paper applies a novel approach for selecting features with XGBoost using the AutoML function of H2O. For this purpose, 5000 to 15000 XGBoost models are trained using this function. We use a very rigorous variable selection procedure, where only the best features are selected into the next stage to build a final learning model. Then pertinent features are introduced into the XGBoost model for forecasting real GDP growth rate. The forecasts are further improved by using Principal Component Analysis to choose the best factors out of the predictors selected by the XGBoost model. The results indicate that there are gains in nowcasting accuracy from using the XGBoost model with this two- step strategy. We first find that XGBoost is a superior forecasting model relative to our baseline models using alternative forecasting procedures such as AR(1). We also find that Google Trends data combined with the XGBoost model provide a very viable source of information for predicting Canadian real GDP growth when official data are not yet available due to publication lags. Thus, we can forecast real GDP growth rate accurately ahead of the release of official data. Moreover, we apply various techniques to make the machine learning model more interpretable.
    Date: 2020–08
  3. By: Steen Nielsen (Department of Economics and Business Economics, Aarhus University)
    Abstract: Not only is the role of data changing in a most dramatic way, but also the way we can handle and use the data through a number of new technologies such as Machine Learning (ML) and Artificial Intelligence (AI). The changes, their speed and scale, as well as their impact on almost every aspect of daily life and, of course, on Management Accounting are almost unbelievable. The term ‘data’ in this context means business data in the broadest possible sense. ML teaches computers to do what comes naturally to humans and decision makers: that is to learn from experience. ML and AI for management accountants have only been sporadically discussed within the last 5-10 years, even though these concepts have been used for a long time now within other business fields such as logistics and finance. ML and AI are extensions of Business Analytics. This paper discusses how machine learning will provide new opportunities and implications for the management accountants in the future. First, it was found that many classical areas and topics within Management Accounting and Performance Management are natural candidates for ML and AI. The true value of the paper lies in making practitioners and researchers more aware of the possibilities of ML for Management Accounting, thereby making the management accountants a real value driver for the company.
    Keywords: Management accounting, machine learning, algorithms, decisions, analytics, management accountant, business translator, performance management
    JEL: C15 M41
    Date: 2020–08–06
  4. By: Juri Hinz; Igor Grigoryev; Alexander Novikov
    Abstract: The economic viability of a mining project depends on its efficient exploration, which requires a prediction of worthwhile ore in a mine deposit. In this work, we apply the so-called LASSO methodology to estimate mineral concentration within unexplored areas. Our methodology outperforms traditional techniques not only in terms of logical consistency, but potentially also in costs reduction. Our approach is illustrated by a full source code listing and a detailed discussion of the advantages and limitations of our approach.
    Keywords: prediction; artificial intelligence; machine learning; LASSO; cross-validation
    Date: 2020–03–01
  5. By: Philippe Goulet Coulombe; Maxime Leroux; Dalibor Stevanovic; Stéphane Surprenant
    Abstract: From a purely predictive standpoint, rotating the predictors’ matrix in a low-dimensional linear regression setup does not alter predictions. However, when the forecasting technology either uses shrinkage or is non-linear, it does. This is precisely the fabric of the machine learning (ML) macroeconomic forecasting environment. Pre-processing of the data translates to an alteration of the regularization – explicit or implicit – embedded in ML algorithms. We review old transformations and propose new ones, then empirically evaluate their merits in a substantial pseudo-out-sample exercise. It is found that traditional factors should almost always be included in the feature matrix and moving average rotations of the data can provide important gains for various forecasting targets.
    Keywords: Machine Learning,Big Data,Forecasting,
    Date: 2020–08–04
  6. By: Cimadomo, Jacopo; Giannone, Domenico; Lenza, Michele; Sokol, Andrej; Monti, Francesca
    Abstract: Monitoring economic conditions in real time, or nowcasting, is among the key tasks routinely performed by economists. Nowcasting entails some key challenges, which also characterise modern Big Data analytics, often referred to as the three \Vs": the large number of time series continuously released (Volume), the complexity of the data covering various sectors of the economy, published in an asynchronous way and with different frequencies and precision (Variety), and the need to incorporate new information within minutes of their release (Velocity). In this paper, we explore alternative routes to bring Bayesian Vector Autoregressive (BVAR) models up to these challenges. We find that BVARs are able to effectively handle the three Vs and produce, in real time, accurate probabilistic predictions of US economic activity and, in addition, a meaningful narrative by means of scenario analysis. JEL Classification: E32, E37, C01, C33, C53
    Keywords: Big Data, business cycles, forecasting, mixed frequencies, real time, scenario analysis
    Date: 2020–08
  7. By: Linwei Hu; Jie Chen; Joel Vaughan; Hanyu Yang; Kelly Wang; Agus Sudjianto; Vijayan N. Nair
    Abstract: This article provides an overview of Supervised Machine Learning (SML) with a focus on applications to banking. The SML techniques covered include Bagging (Random Forest or RF), Boosting (Gradient Boosting Machine or GBM) and Neural Networks (NNs). We begin with an introduction to ML tasks and techniques. This is followed by a description of: i) tree-based ensemble algorithms including Bagging with RF and Boosting with GBMs, ii) Feedforward NNs, iii) a discussion of hyper-parameter optimization techniques, and iv) machine learning interpretability. The paper concludes with a comparison of the features of different ML algorithms. Examples taken from credit risk modeling in banking are used throughout the paper to illustrate the techniques and interpret the results of the algorithms.
    Date: 2020–07
  8. By: Andrii Babii; Ryan T. Ball; Eric Ghysels; Jonas Striaukas
    Abstract: This paper introduces structured machine learning regressions for prediction and nowcasting with panel data consisting of series sampled at different frequencies. Motivated by the empirical problem of predicting corporate earnings for a large cross-section of firms with macroeconomic, financial, and news time series sampled at different frequencies, we focus on the sparse-group LASSO regularization. This type of regularization can take advantage of the mixed frequency time series panel data structures and we find that it empirically outperforms the unstructured machine learning methods. We obtain oracle inequalities for the pooled and fixed effects sparse-group LASSO panel data estimators recognizing that financial and economic data exhibit heavier than Gaussian tails. To that end, we leverage on a novel Fuk-Nagaev concentration inequality for panel data consisting of heavy-tailed $\tau$-mixing processes which may be of independent interest in other high-dimensional panel data settings.
    Date: 2020–08
  9. By: von Essen, Emma (Swedish Institute for Social Research, Stockholm University); Jansson, Joakim (Department of Economics and Statistics, Linnaeus University)
    Abstract: In this paper, we quantify hateful content in online civic discussions of politics and estimate the causal link between hateful content and writer anonymity. To measure hate, we first develop a supervised machine-learning model that predicts hate against foreign residents and hate against women on a dominant Swedish Internet discussion forum. We find that an exogenous decrease in writer anonymity leads to less hate against foreign residents but an increase in hate against women. We conjecture that the mechanisms behind the changes comprise a combination of users decreasing the amount of their hateful writing and a substitution of hate against foreign residents for hate against women. The discussion of the results highlights the role of social repercussions in discouraging antisocial and criminal activities.
    Keywords: Online hate; Anonymity; Discussion forum; Machine learning; Big data
    JEL: C55 D00 D80 D90
    Date: 2020–08–14
  10. By: Omid Safarzadeh
    Abstract: This research investigates efficiency on-line learning Algorithms to generate trading signals.I employed technical indicators based on high frequency stock prices and generated trading signals through ensemble of Random Forests. Similarly, Kalman Filter was used for signaling trading positions. Comparing Time Series methods with Machine Learning methods, results spurious of Kalman Filter to Random Forests in case of on-line learning predictions of stock prices
    Date: 2020–07
  11. By: Oz Shy
    Abstract: Using machine learning techniques applied to consumer diary survey data, the author of this working paper examines methods for studying consumer payment choice. These techniques, especially when paired with regression analyses, provide useful information for understanding and predicting the payment choices consumers make.
    Keywords: studying consumer payment choice; point of sale; statistical learning; machine learning
    JEL: C19 E42
    Date: 2020–06–23
  12. By: Jeremy Ferwerda; Nicholas Adams-Cohen; Kirk Bansak; Jennifer Fei; Duncan Lawrence; Jeremy M. Weinstein; Jens Hainmueller
    Abstract: A growing number of countries have established programs to attract immigrants who can contribute to their economy. Research suggests that an immigrant's initial arrival location plays a key role in shaping their economic success. Yet immigrants currently lack access to personalized information that would help them identify optimal destinations. Instead, they often rely on availability heuristics, which can lead to the selection of sub-optimal landing locations, lower earnings, elevated outmigration rates, and concentration in the most well-known locations. To address this issue and counteract the effects of cognitive biases and limited information, we propose a data-driven decision helper that draws on behavioral insights, administrative data, and machine learning methods to inform immigrants' location decisions. The decision helper provides personalized location recommendations that reflect immigrants' preferences as well as data-driven predictions of the locations where they maximize their expected earnings given their profile. We illustrate the potential impact of our approach using backtests conducted with administrative data that links landing data of recent economic immigrants from Canada's Express Entry system with their earnings retrieved from tax records. Simulations across various scenarios suggest that providing location recommendations to incoming economic immigrants can increase their initial earnings and lead to a mild shift away from the most populous landing destinations. Our approach can be implemented within existing institutional structures at minimal cost, and offers governments an opportunity to harness their administrative data to improve outcomes for economic immigrants.
    Date: 2020–07
  13. By: Daniel Wochner (KOF Swiss Economic Institute, ETH Zurich, Switzerland)
    Abstract: Machine Learning models are often considered to be “black boxes†that provide only little room for the incorporation of theory (cf. e.g. Mukherjee, 2017; Veltri, 2017). This article proposes so-called Dynamic Factor Trees (DFT) and Dynamic Factor Forests (DFF) for macroeconomic forecasting, which synthesize the recent machine learning, dynamic factor model and business cycle literature within a unified statistical machine learning framework for model-based recursive partitioning proposed in Zeileis, Hothorn and Hornik (2008). DFTs and DFFs are non-linear and state-dependent forecasting models, which reduce to the standard Dynamic Factor Model (DFM) as a special case and allow us to embed theory-led factor models in powerful tree-based machine learning ensembles conditional on the state of the business cycle. The out-of-sample forecasting experiment for short-term U.S. GDP growth predictions combines three distinct FRED-datasets, yielding a balanced panel with over 375 indicators from 1967 to 2018 (FRED, 2019; McCracken & Ng, 2016, 2019a, 2019b). Our results provide strong empirical evidence in favor of the proposed DFTs and DFFs and show that they significantly improve the predictive performance of DFMs by almost 20% in terms of MSFE. Interestingly, the improvements materialize in both expansionary and recessionary periods, suggesting that DFTs and DFFs tend to perform not only sporadically but systematically better than DFMs. Our findings are fairly robust to a number of sensitivity tests and hold exciting avenues for future research.
    Keywords: Forecasting, Machine Learning, Regression Trees and Forests, Dynamic Factor Model, Business Cycles, GDP Growth, United States
    JEL: C45 C51 C53 E32 O47
    Date: 2020–05
  14. By: Monica Azqueta-Gavaldon; Gonzalo Azqueta-Gavaldon; Inigo Azqueta-Gavaldon; Andres Azqueta-Gavaldon
    Abstract: This project aims at creating an investment device to help investors determine which real estate units have a higher return to investment in Madrid. To do so, we gather data from, a real estate web-page with millions of real estate units across Spain, Italy and Portugal. In this preliminary version, we present the road map on how we gather the data; descriptive statistics of the 8,121 real estate units gathered (rental and sale); build a return index based on the difference in prices of rental and sale units(per neighbourhood and size) and introduce machine learning algorithms for rental real estate price prediction.
    Date: 2020–06
  15. By: Fantazzini, Dean
    Abstract: The ability of Google Trends data to forecast the number of new daily cases and deaths of COVID-19 is examined using a dataset of 158 countries. The analysis includes the computations of lag correlations between confirmed cases and Google data, Granger causality tests, and an out-of-sample forecasting exercise with 18 competing models with a forecast horizon of 14 days ahead. This evidence shows that Google-augmented models outperform the competing models for most of the countries. This is significant because Google data can complement epidemiological models during difficult times like the ongoing COVID-19 pandemic, when official statistics maybe not fully reliable and/or published with a delay. Moreover, real-time tracking with online-data is one of the instruments that can be used to keep the situation under control when national lockdowns are lifted and economies gradually reopen.
    Keywords: Covid-19; Google Trends; VAR; ARIMA; ARIMA-X; ETS; LASSO; SIR model
    JEL: C22 C32 C51 C53 G17 I18 I19
    Date: 2020–08
  16. By: Greyling, Talita; Rossouw, Stephanie; Adhikari, Tamanna
    Abstract: Since the start of the Covid-19 pandemic, many governments have implemented lockdown regulations to curb the spread of the virus. Though lockdowns do minimise the physical damage of the virus, there may be substantial damage to population well-being. Using a pooled dataset, this paper analyses the causal effect of mandatory lockdown on happiness in three very diverse countries (South Africa, New Zealand, and Australia), regarding population size, economic development and well-being levels. Additionally, each country differs in terms of lockdown regulations and duration. The main idea is to determine, notwithstanding the characteristics of a country or the lockdown regulations, whether a lockdown negatively affects happiness. Secondly, we compare the effect size of the lockdown on happiness between these countries. We make use of Difference-in-Difference estimations to determine the causal effect of the lockdown and Least Squares Dummy Variable estimations to study the heterogeneity in the effect size of the lockdown by country. Our results show that, regardless of the characteristics of the country, or the type or duration of the lockdown regulations; a lockdown causes a decline in happiness. Furthermore, the negative effect differs between countries, seeming that the more stringent the stay-at-home regulations are, the greater the negative effect.
    Keywords: Happiness,Covid-19,Big data,Difference-in-Difference
    JEL: C55 I12 I31 J18
    Date: 2020
  17. By: Xue-Zhong He (Finance Discipline Group, UTS Business School, University of Technology Sydney); Shen Lin
    Abstract: Information-based reinforcement learning is effective for trading and price discovery in limit order markets. It helps traders to learn a statistical equilibrium in which traders' expected payoffs and out-sample payoffs are highly correlated. Consistent with rational equilibrium models, the order choice between buy and sell and between market and limit orders for informed traders mainly depends on their information about fundamental value, while uninformed traders trade on a short-run momentum of the informed market orders. The learning increases liquidity supply of uninformed and liquidity consumption of informed, generating diagonal effect on order submission and hump-shaped order books, and improving traders' profitability and price discovery. The results shed a light into the market practice of using machine learning in limit order markets.
    Keywords: Reinforcement Learning; Order Book Information; Limit Orders; Momentum Trading
    JEL: G14 C63 D82 D83
    Date: 2019–02–01

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.