nep-big 2020-08-31 papers

on Big Data

Issue of 2020‒08‒31
seventeen papers chosen by
Tom Coupé
University of Canterbury

Crowd, Lending, Machine, and Bias By Runshan Fu; Yan Huang; Param Vir Singh
The Effect of COVID-19 Lockdown on Mobility and Traffic Accidents: Evidence from Louisiana By Shafiullah Qureshi; Ba M Chu; Fanny S. Demers
Management accounting and the idea of machine learning By Steen Nielsen
An Application of High-Dimensional Statistics to Predictive Modeling of Grade Variability By Juri Hinz; Igor Grigoryev; Alexander Novikov
Macroeconomic Data Transformations Matter By Philippe Goulet Coulombe; Maxime Leroux; Dalibor Stevanovic; Stéphane Surprenant
Nowcasting with large Bayesian vector autoregressions By Cimadomo, Jacopo; Giannone, Domenico; Lenza, Michele; Sokol, Andrej; Monti, Francesca
Supervised Machine Learning Techniques: An Overview with Applications to Banking By Linwei Hu; Jie Chen; Joel Vaughan; Hanyu Yang; Kelly Wang; Agus Sudjianto; Vijayan N. Nair
Machine Learning Panel Data Regressions with an Application to Nowcasting Price Earnings Ratios By Andrii Babii; Ryan T. Ball; Eric Ghysels; Jonas Striaukas
Misogynistic and Xenophobic Hate Language Online: A Matter of Anonymity By von Essen, Emma; Jansson, Joakim
Generating Trading Signals by ML algorithms or time series ones? By Omid Safarzadeh
Alternative Methods for Studying Consumer Payment Choice By Oz Shy
Leveraging the Power of Place: A Data-Driven Decision Helper to Improve the Location Decisions of Economic Immigrants By Jeremy Ferwerda; Nicholas Adams-Cohen; Kirk Bansak; Jennifer Fei; Duncan Lawrence; Jeremy M. Weinstein; Jens Hainmueller
Dynamic Factor Trees and Forests â€“ A Theory-led Machine Learning Framework for Non-Linear and State-Dependent Short-Term U.S. GDP Growth Predictions By Daniel Wochner
Developing a real estate yield investment deviceusing granular data and machine learning By Monica Azqueta-Gavaldon; Gonzalo Azqueta-Gavaldon; Inigo Azqueta-Gavaldon; Andres Azqueta-Gavaldon
Short-term forecasting of the COVID-19 pandemic using Google Trends data: Evidence from 158 countries By Fantazzini, Dean
A tale of three countries: How did Covid-19 lockdown impact happiness? By Greyling, Talita; Rossouw, Stephanie; Adhikari, Tamanna
Reinforcement Learning in Limit Order Markets By Xue-Zhong He; Shen Lin

By:	Runshan Fu; Yan Huang; Param Vir Singh
Abstract:	Big data and machine learning (ML) algorithms are key drivers of many fintech innovations. While it may be obvious that replacing humans with machine would increase efficiency, it is not clear whether and where machines can make better decisions than humans. We answer this question in the context of crowd lending, where decisions are traditionally made by a crowd of investors. Using data from Prosper.com, we show that a reasonably sophisticated ML algorithm predicts listing default probability more accurately than crowd investors. The dominance of the machine over the crowd is more pronounced for highly risky listings. We then use the machine to make investment decisions, and find that the machine benefits not only the lenders but also the borrowers. When machine prediction is used to select loans, it leads to a higher rate of return for investors and more funding opportunities for borrowers with few alternative funding options. We also find suggestive evidence that the machine is biased in gender and race even when it does not use gender and race information as input. We propose a general and effective "debasing" method that can be applied to any prediction focused ML applications, and demonstrate its use in our context. We show that the debiased ML algorithm, which suffers from lower prediction accuracy, still leads to better investment decisions compared with the crowd. These results indicate that ML can help crowd lending platforms better fulfill the promise of providing access to financial resources to otherwise underserved individuals and ensure fairness in the allocation of these resources.
Date:	2020–07
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2008.04068&r=all

The Effect of COVID-19 Lockdown on Mobility and Traffic Accidents: Evidence from Louisiana

By:	Shafiullah Qureshi (Department of Economics, Carleton University); Ba M Chu (Department of Economics, Carleton University); Fanny S. Demers (Department of Economics, Carleton University)
Abstract:	The objective of this paper is to apply state-of-the-art machine-learning (ML) algorithms to predict the monthly and quarterly real GDP growth of Canada using both Google trends (GT) and official data that are available ahead of the release of GDP data by Statistics Canada. This paper applies a novel approach for selecting features with XGBoost using the AutoML function of H2O. For this purpose, 5000 to 15000 XGBoost models are trained using this function. We use a very rigorous variable selection procedure, where only the best features are selected into the next stage to build a final learning model. Then pertinent features are introduced into the XGBoost model for forecasting real GDP growth rate. The forecasts are further improved by using Principal Component Analysis to choose the best factors out of the predictors selected by the XGBoost model. The results indicate that there are gains in nowcasting accuracy from using the XGBoost model with this two- step strategy. We first find that XGBoost is a superior forecasting model relative to our baseline models using alternative forecasting procedures such as AR(1). We also find that Google Trends data combined with the XGBoost model provide a very viable source of information for predicting Canadian real GDP growth when official data are not yet available due to publication lags. Thus, we can forecast real GDP growth rate accurately ahead of the release of official data. Moreover, we apply various techniques to make the machine learning model more interpretable.
Date:	2020–08
URL:	http://d.repec.org/n?u=RePEc:car:carecp:20-14&r=all

Management accounting and the idea of machine learning

By:	Steen Nielsen (Department of Economics and Business Economics, Aarhus University)
Abstract:	Not only is the role of data changing in a most dramatic way, but also the way we can handle and use the data through a number of new technologies such as Machine Learning (ML) and Artificial Intelligence (AI). The changes, their speed and scale, as well as their impact on almost every aspect of daily life and, of course, on Management Accounting are almost unbelievable. The term ‘data’ in this context means business data in the broadest possible sense. ML teaches computers to do what comes naturally to humans and decision makers: that is to learn from experience. ML and AI for management accountants have only been sporadically discussed within the last 5-10 years, even though these concepts have been used for a long time now within other business fields such as logistics and finance. ML and AI are extensions of Business Analytics. This paper discusses how machine learning will provide new opportunities and implications for the management accountants in the future. First, it was found that many classical areas and topics within Management Accounting and Performance Management are natural candidates for ML and AI. The true value of the paper lies in making practitioners and researchers more aware of the possibilities of ML for Management Accounting, thereby making the management accountants a real value driver for the company.
Keywords:	Management accounting, machine learning, algorithms, decisions, analytics, management accountant, business translator, performance management
JEL:	C15 M41
Date:	2020–08–06
URL:	http://d.repec.org/n?u=RePEc:aah:aarhec:2020-09&r=all

An Application of High-Dimensional Statistics to Predictive Modeling of Grade Variability

By:	Juri Hinz; Igor Grigoryev; Alexander Novikov
Abstract:	The economic viability of a mining project depends on its efficient exploration, which requires a prediction of worthwhile ore in a mine deposit. In this work, we apply the so-called LASSO methodology to estimate mineral concentration within unexplored areas. Our methodology outperforms traditional techniques not only in terms of logical consistency, but potentially also in costs reduction. Our approach is illustrated by a full source code listing and a detailed discussion of the advantages and limitations of our approach.
Keywords:	prediction; artificial intelligence; machine learning; LASSO; cross-validation
Date:	2020–03–01
URL:	http://d.repec.org/n?u=RePEc:uts:rpaper:407&r=all

Macroeconomic Data Transformations Matter

By:	Philippe Goulet Coulombe; Maxime Leroux; Dalibor Stevanovic; Stéphane Surprenant
Abstract:	From a purely predictive standpoint, rotating the predictors’ matrix in a low-dimensional linear regression setup does not alter predictions. However, when the forecasting technology either uses shrinkage or is non-linear, it does. This is precisely the fabric of the machine learning (ML) macroeconomic forecasting environment. Pre-processing of the data translates to an alteration of the regularization – explicit or implicit – embedded in ML algorithms. We review old transformations and propose new ones, then empirically evaluate their merits in a substantial pseudo-out-sample exercise. It is found that traditional factors should almost always be included in the feature matrix and moving average rotations of the data can provide important gains for various forecasting targets.
Keywords:	Machine Learning,Big Data,Forecasting,
Date:	2020–08–04
URL:	http://d.repec.org/n?u=RePEc:cir:cirwor:2020s-42&r=all

Nowcasting with large Bayesian vector autoregressions

By:	Cimadomo, Jacopo; Giannone, Domenico; Lenza, Michele; Sokol, Andrej; Monti, Francesca
Abstract:	Monitoring economic conditions in real time, or nowcasting, is among the key tasks routinely performed by economists. Nowcasting entails some key challenges, which also characterise modern Big Data analytics, often referred to as the three \Vs": the large number of time series continuously released (Volume), the complexity of the data covering various sectors of the economy, published in an asynchronous way and with different frequencies and precision (Variety), and the need to incorporate new information within minutes of their release (Velocity). In this paper, we explore alternative routes to bring Bayesian Vector Autoregressive (BVAR) models up to these challenges. We find that BVARs are able to effectively handle the three Vs and produce, in real time, accurate probabilistic predictions of US economic activity and, in addition, a meaningful narrative by means of scenario analysis. JEL Classification: E32, E37, C01, C33, C53
Keywords:	Big Data, business cycles, forecasting, mixed frequencies, real time, scenario analysis
Date:	2020–08
URL:	http://d.repec.org/n?u=RePEc:ecb:ecbwps:20202453&r=all

Supervised Machine Learning Techniques: An Overview with Applications to Banking

By:	Linwei Hu; Jie Chen; Joel Vaughan; Hanyu Yang; Kelly Wang; Agus Sudjianto; Vijayan N. Nair
Abstract:	This article provides an overview of Supervised Machine Learning (SML) with a focus on applications to banking. The SML techniques covered include Bagging (Random Forest or RF), Boosting (Gradient Boosting Machine or GBM) and Neural Networks (NNs). We begin with an introduction to ML tasks and techniques. This is followed by a description of: i) tree-based ensemble algorithms including Bagging with RF and Boosting with GBMs, ii) Feedforward NNs, iii) a discussion of hyper-parameter optimization techniques, and iv) machine learning interpretability. The paper concludes with a comparison of the features of different ML algorithms. Examples taken from credit risk modeling in banking are used throughout the paper to illustrate the techniques and interpret the results of the algorithms.
Date:	2020–07
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2008.04059&r=all

Machine Learning Panel Data Regressions with an Application to Nowcasting Price Earnings Ratios

By:	Andrii Babii; Ryan T. Ball; Eric Ghysels; Jonas Striaukas
Abstract:	This paper introduces structured machine learning regressions for prediction and nowcasting with panel data consisting of series sampled at different frequencies. Motivated by the empirical problem of predicting corporate earnings for a large cross-section of firms with macroeconomic, financial, and news time series sampled at different frequencies, we focus on the sparse-group LASSO regularization. This type of regularization can take advantage of the mixed frequency time series panel data structures and we find that it empirically outperforms the unstructured machine learning methods. We obtain oracle inequalities for the pooled and fixed effects sparse-group LASSO panel data estimators recognizing that financial and economic data exhibit heavier than Gaussian tails. To that end, we leverage on a novel Fuk-Nagaev concentration inequality for panel data consisting of heavy-tailed $\tau$-mixing processes which may be of independent interest in other high-dimensional panel data settings.
Date:	2020–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2008.03600&r=all

Misogynistic and Xenophobic Hate Language Online: A Matter of Anonymity

By:	von Essen, Emma (Swedish Institute for Social Research, Stockholm University); Jansson, Joakim (Department of Economics and Statistics, Linnaeus University)
Abstract:	In this paper, we quantify hateful content in online civic discussions of politics and estimate the causal link between hateful content and writer anonymity. To measure hate, we first develop a supervised machine-learning model that predicts hate against foreign residents and hate against women on a dominant Swedish Internet discussion forum. We find that an exogenous decrease in writer anonymity leads to less hate against foreign residents but an increase in hate against women. We conjecture that the mechanisms behind the changes comprise a combination of users decreasing the amount of their hateful writing and a substitution of hate against foreign residents for hate against women. The discussion of the results highlights the role of social repercussions in discouraging antisocial and criminal activities.
Keywords:	Online hate; Anonymity; Discussion forum; Machine learning; Big data
JEL:	C55 D00 D80 D90
Date:	2020–08–14
URL:	http://d.repec.org/n?u=RePEc:hhs:iuiwop:1350&r=all

Generating Trading Signals by ML algorithms or time series ones?

By:	Omid Safarzadeh
Abstract:	This research investigates efficiency on-line learning Algorithms to generate trading signals.I employed technical indicators based on high frequency stock prices and generated trading signals through ensemble of Random Forests. Similarly, Kalman Filter was used for signaling trading positions. Comparing Time Series methods with Machine Learning methods, results spurious of Kalman Filter to Random Forests in case of on-line learning predictions of stock prices
Date:	2020–07
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2007.11098&r=all

Alternative Methods for Studying Consumer Payment Choice

By:	Oz Shy
Abstract:	Using machine learning techniques applied to consumer diary survey data, the author of this working paper examines methods for studying consumer payment choice. These techniques, especially when paired with regression analyses, provide useful information for understanding and predicting the payment choices consumers make.
Keywords:	studying consumer payment choice; point of sale; statistical learning; machine learning
JEL:	C19 E42
Date:	2020–06–23
URL:	http://d.repec.org/n?u=RePEc:fip:fedawp:88473&r=all

Leveraging the Power of Place: A Data-Driven Decision Helper to Improve the Location Decisions of Economic Immigrants

By:	Jeremy Ferwerda; Nicholas Adams-Cohen; Kirk Bansak; Jennifer Fei; Duncan Lawrence; Jeremy M. Weinstein; Jens Hainmueller
Abstract:	A growing number of countries have established programs to attract immigrants who can contribute to their economy. Research suggests that an immigrant's initial arrival location plays a key role in shaping their economic success. Yet immigrants currently lack access to personalized information that would help them identify optimal destinations. Instead, they often rely on availability heuristics, which can lead to the selection of sub-optimal landing locations, lower earnings, elevated outmigration rates, and concentration in the most well-known locations. To address this issue and counteract the effects of cognitive biases and limited information, we propose a data-driven decision helper that draws on behavioral insights, administrative data, and machine learning methods to inform immigrants' location decisions. The decision helper provides personalized location recommendations that reflect immigrants' preferences as well as data-driven predictions of the locations where they maximize their expected earnings given their profile. We illustrate the potential impact of our approach using backtests conducted with administrative data that links landing data of recent economic immigrants from Canada's Express Entry system with their earnings retrieved from tax records. Simulations across various scenarios suggest that providing location recommendations to incoming economic immigrants can increase their initial earnings and lead to a mild shift away from the most populous landing destinations. Our approach can be implemented within existing institutional structures at minimal cost, and offers governments an opportunity to harness their administrative data to improve outcomes for economic immigrants.
Date:	2020–07
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2007.13902&r=all

Dynamic Factor Trees and Forests â€“ A Theory-led Machine Learning Framework for Non-Linear and State-Dependent Short-Term U.S. GDP Growth Predictions

By:	Daniel Wochner (KOF Swiss Economic Institute, ETH Zurich, Switzerland)
Abstract:	Machine Learning models are often considered to be â€œblack boxesâ€ that provide only little room for the incorporation of theory (cf. e.g. Mukherjee, 2017; Veltri, 2017). This article proposes so-called Dynamic Factor Trees (DFT) and Dynamic Factor Forests (DFF) for macroeconomic forecasting, which synthesize the recent machine learning, dynamic factor model and business cycle literature within a unified statistical machine learning framework for model-based recursive partitioning proposed in Zeileis, Hothorn and Hornik (2008). DFTs and DFFs are non-linear and state-dependent forecasting models, which reduce to the standard Dynamic Factor Model (DFM) as a special case and allow us to embed theory-led factor models in powerful tree-based machine learning ensembles conditional on the state of the business cycle. The out-of-sample forecasting experiment for short-term U.S. GDP growth predictions combines three distinct FRED-datasets, yielding a balanced panel with over 375 indicators from 1967 to 2018 (FRED, 2019; McCracken & Ng, 2016, 2019a, 2019b). Our results provide strong empirical evidence in favor of the proposed DFTs and DFFs and show that they significantly improve the predictive performance of DFMs by almost 20% in terms of MSFE. Interestingly, the improvements materialize in both expansionary and recessionary periods, suggesting that DFTs and DFFs tend to perform not only sporadically but systematically better than DFMs. Our findings are fairly robust to a number of sensitivity tests and hold exciting avenues for future research.
Keywords:	Forecasting, Machine Learning, Regression Trees and Forests, Dynamic Factor Model, Business Cycles, GDP Growth, United States
JEL:	C45 C51 C53 E32 O47
Date:	2020–05
URL:	http://d.repec.org/n?u=RePEc:kof:wpskof:20-472&r=all

Developing a real estate yield investment deviceusing granular data and machine learning

By:	Monica Azqueta-Gavaldon; Gonzalo Azqueta-Gavaldon; Inigo Azqueta-Gavaldon; Andres Azqueta-Gavaldon
Abstract:	This project aims at creating an investment device to help investors determine which real estate units have a higher return to investment in Madrid. To do so, we gather data from Idealista.com, a real estate web-page with millions of real estate units across Spain, Italy and Portugal. In this preliminary version, we present the road map on how we gather the data; descriptive statistics of the 8,121 real estate units gathered (rental and sale); build a return index based on the difference in prices of rental and sale units(per neighbourhood and size) and introduce machine learning algorithms for rental real estate price prediction.
Date:	2020–06
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2008.02629&r=all

Short-term forecasting of the COVID-19 pandemic using Google Trends data: Evidence from 158 countries

By:	Fantazzini, Dean
Abstract:	The ability of Google Trends data to forecast the number of new daily cases and deaths of COVID-19 is examined using a dataset of 158 countries. The analysis includes the computations of lag correlations between confirmed cases and Google data, Granger causality tests, and an out-of-sample forecasting exercise with 18 competing models with a forecast horizon of 14 days ahead. This evidence shows that Google-augmented models outperform the competing models for most of the countries. This is significant because Google data can complement epidemiological models during difficult times like the ongoing COVID-19 pandemic, when official statistics maybe not fully reliable and/or published with a delay. Moreover, real-time tracking with online-data is one of the instruments that can be used to keep the situation under control when national lockdowns are lifted and economies gradually reopen.
Keywords:	Covid-19; Google Trends; VAR; ARIMA; ARIMA-X; ETS; LASSO; SIR model
JEL:	C22 C32 C51 C53 G17 I18 I19
Date:	2020–08
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:102315&r=all

A tale of three countries: How did Covid-19 lockdown impact happiness?

By:	Greyling, Talita; Rossouw, Stephanie; Adhikari, Tamanna
Abstract:	Since the start of the Covid-19 pandemic, many governments have implemented lockdown regulations to curb the spread of the virus. Though lockdowns do minimise the physical damage of the virus, there may be substantial damage to population well-being. Using a pooled dataset, this paper analyses the causal effect of mandatory lockdown on happiness in three very diverse countries (South Africa, New Zealand, and Australia), regarding population size, economic development and well-being levels. Additionally, each country differs in terms of lockdown regulations and duration. The main idea is to determine, notwithstanding the characteristics of a country or the lockdown regulations, whether a lockdown negatively affects happiness. Secondly, we compare the effect size of the lockdown on happiness between these countries. We make use of Difference-in-Difference estimations to determine the causal effect of the lockdown and Least Squares Dummy Variable estimations to study the heterogeneity in the effect size of the lockdown by country. Our results show that, regardless of the characteristics of the country, or the type or duration of the lockdown regulations; a lockdown causes a decline in happiness. Furthermore, the negative effect differs between countries, seeming that the more stringent the stay-at-home regulations are, the greater the negative effect.
Keywords:	Happiness,Covid-19,Big data,Difference-in-Difference
JEL:	C55 I12 I31 J18
Date:	2020
URL:	http://d.repec.org/n?u=RePEc:zbw:glodps:584&r=all

Reinforcement Learning in Limit Order Markets

By:	Xue-Zhong He (Finance Discipline Group, UTS Business School, University of Technology Sydney); Shen Lin
Abstract:	Information-based reinforcement learning is effective for trading and price discovery in limit order markets. It helps traders to learn a statistical equilibrium in which traders' expected payoffs and out-sample payoffs are highly correlated. Consistent with rational equilibrium models, the order choice between buy and sell and between market and limit orders for informed traders mainly depends on their information about fundamental value, while uninformed traders trade on a short-run momentum of the informed market orders. The learning increases liquidity supply of uninformed and liquidity consumption of informed, generating diagonal effect on order submission and hump-shaped order books, and improving traders' profitability and price discovery. The results shed a light into the market practice of using machine learning in limit order markets.
Keywords:	Reinforcement Learning; Order Book Information; Limit Orders; Momentum Trading
JEL:	G14 C63 D82 D83
Date:	2019–02–01
URL:	http://d.repec.org/n?u=RePEc:uts:rpaper:403&r=all

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.