nep-big New Economics Papers
on Big Data
Issue of 2020‒06‒22
twenty-one papers chosen by
Tom Coupé
University of Canterbury

  1. Econometric Methods and Data Science Techniques: A Review of Two Strands of Literature and an Introduction to Hybrid Methods By Xie, Tian; Yu, Jun; Zeng, Tao
  2. Machine Learning in Gravity Models: An Application to Agricultural Trade By Munisamy Gopinath; Feras A. Batarseh; Jayson Beckman
  3. The best way to select features? By Xin Man; Ernest Chan
  4. Computing platforms for big data analytics and artificial intelligence By Giuseppe Bruno; Hiren Jani; Rafael Schmidt; Bruno Tissot
  5. Machine Learning for Zombie Hunting. Firms Failures and Financial Constraints. By Falco J. Bargagli-Dtoffi; Massimo Riccaboni; Armando Rungi
  6. Financial option valuation by unsupervised learning with artificial neural networks By Beatriz Salvador; Cornelis W. Oosterlee; Remco van der Meer
  7. Financial frictions and the wealth distribution By Jesús Fernández-Villaverde; Samuel Hurtado; Galo Nuño
  8. The Role of Investor Sentiment in Forecasting Housing Returns in China: A Machine Learning Approach By Oguzhan Cepni; Rangan Gupta; Yigit Onay
  9. The Allocation of Decision Authority to Human and Artificial Intelligence By Athey, Susan; Bryan, Kevin; Gans, Joshua S.
  10. Sig-SDEs model for quantitative finance By Imanol Perez Arribas; Cristopher Salvi; Lukasz Szpruch
  11. Measuring Commuting and Economic Activity inside Cities with Cell Phone Records By Gabriel E. Kreindler; Yuhei Miyauchi
  12. Using Network Interbank Contagion in Bank Default Prediction By Riccardo Doyle
  13. How Boltzmann Entropy Improves Prediction with LSTM By Grilli, Luca; Santoro, Domenico
  14. BIG DATA e a Informação Pública na Tomada de Decisão By PALETTA, FRANCISCO CARLOS; da Conceição, Rafael Sena
  15. The Rise of Fintech: A Cross-Country Perspective By Aurore Oskar KOWALEWSKI; Paweł PISANY
  16. Planes, Trains, and Automobiles: Night-time Lights of the USA By Dickinson, Jeffrey
  17. Theoretical Guarantees for Learning Conditional Expectation using Controlled ODE-RNN By Calypso Herrera; Florian Krach; Josef Teichmann
  18. Earnings Prediction with Deep Leaning By Lars Elend; Sebastian A. Tideman; Kerstin Lopatta; Oliver Kramer
  19. Firm-Level Exposure to Epidemic Diseases: Covid-19, SARS, and H1N1 By Hassan, Tarek Alexander; Hollander, Stephan; Tahoun, Ahmed; van Lent, Laurence
  20. Influence via Ethos: On the Persuasive Power of Reputation in Deliberation Online By Emaad Manzoor; George H. Chen; Dokyun Lee; Michael D. Smith
  21. Text-mining IMF country reports - an original dataset By Mihalyi, David; Mate, Akos

  1. By: Xie, Tian (Shanghai University of Finance and Economics); Yu, Jun (School of Economics, Singapore Management University); Zeng, Tao (Zhejiang University)
    Abstract: The data market has been growing at an exceptional pace. Consequently, more sophisticated strategies to conduct economic forecasts have been introduced with machine learning techniques. Does machine learning pose a threat to conventional econometric methods in terms of forecasting? Moreover, does machine learning present great opportunities to cross-fertilize the field of econometric forecasting? In this report, we develop a pedagogical framework that identifies complementarity and bridges between the two strands of literature. Existing econometric methods and machine learning techniques for economic forecasting are reviewed and compared. The advantages and disadvantages of these two classes of methods are discussed. A class of hybrid methods that combine conventional econometrics and machine learning are introduced. New directions for integrating the above two are suggested. The out-of-sample performance of alternatives is compared when they are employed to forecast the Chicago Board Options Exchange Volatility Index and the harmonized index of consumer prices for the euro area. In the first exercise, econometric methods seem to work better, whereas machine learning methods generally dominate in the second empirical application.
    Date: 2020–05–30
  2. By: Munisamy Gopinath; Feras A. Batarseh; Jayson Beckman
    Abstract: Predicting agricultural trade patterns is critical to decision making in the public and private domains, especially in the current context of trade disputes among major economies. Focusing on seven major agricultural commodities with a long history of trade, this study employed data-driven and deep-learning processes: supervised and unsupervised machine learning (ML) techniques – to decipher patterns of trade. The supervised (unsupervised) ML techniques were trained on data until 2010 (2014), and projections were made for 2011-2016 (2014-2020). Results show the high relevance of ML models to predicting trade patterns in near- and long-term relative to traditional approaches, which are often subjective assessments or time-series projections. While supervised ML techniques quantified key economic factors underlying agricultural trade flows, unsupervised approaches provide better fits over the long-term.
    JEL: C45 F14 Q17
    Date: 2020–05
  3. By: Xin Man; Ernest Chan
    Abstract: Feature selection in machine learning is subject to the intrinsic randomness of the feature selection algorithms (for example, random permutations during MDA). Stability of selected features with respect to such randomness is essential to the human interpretability of a machine learning algorithm. We proposes a rank based stability metric called instability index to compare the stabilities of three feature selection algorithms MDA, LIME, and SHAP as applied to random forests. Typically, features are selected by averaging many random iterations of a selection algorithm. Though we find that the variability of the selected features does decrease as the number of iterations increases, it does not go to zero, and the features selected by the three algorithms do not necessarily converge to the same set. We find LIME and SHAP to be more stable than MDA, and LIME is at least as stable as SHAP for the top ranked features. Hence overall LIME is best suited for human interpretability. However, the selected set of features from all three algorithms significantly improves various predictive metrics out of sample, and their predictive performances do not differ significantly. Experiments were conducted on synthetic datasets, two public benchmark datasets, and on proprietary data from an active investment strategy.
    Date: 2020–05
  4. By: Giuseppe Bruno; Hiren Jani; Rafael Schmidt; Bruno Tissot
    Date: 2020–04–24
  5. By: Falco J. Bargagli-Dtoffi (IMT School for advanced studies); Massimo Riccaboni (IMT School for advanced studies); Armando Rungi (IMT School for advanced studies)
    Abstract: In this contribution, we exploit machine learning techniques to predict the risk of failure of firms. Then, we propose an empirical definition of zombies as firms that persist in a status of high risk, beyond the highest decile, after which we observe that the chances to transit to lower risk are minimal. We implement a Bayesian Additive Regression Tree with Missing Incorporated in Attributes (BART-MIA), which is specifically useful in our setting as we provide evidence that patterns of undisclosed accounts correlate with firms failures. After training our algorithm on 304,906 firms active in Italy in the period 2008-2017, we show how it outperforms proxy models like the Z-scores and the Distance-to-Default, traditional econometric methods, and other widely used machine learning techniques. We document that zombies are on average 21% less productive, 76% smaller, and they increased in times of financial crisis. In general, we argue that our application helps in the design of evidence-based policies in the presence of market failures, for example optimal bankruptcy laws. We believe our framework can help to inform the design of support programs for highly distressed firms after the recent pandemic crisis.
    Keywords: machine learning; Bayesian statistical learning; financial constraints; bankruptcy;zombie firms
    JEL: C53 C55 G32 G33 L21 L25
    Date: 2020–06
  6. By: Beatriz Salvador; Cornelis W. Oosterlee; Remco van der Meer
    Abstract: Artificial neural networks (ANNs) have recently also been applied to solve partial differential equations (PDEs). In this work, the classical problem of pricing European and American financial options, based on the corresponding PDE formulations, is studied. Instead of using numerical techniques based on finite element or difference methods, we address the problem using ANNs in the context of unsupervised learning. As a result, the ANN learns the option values for all possible underlying stock values at future time points, based on the minimization of a suitable loss function. For the European option, we solve the linear Black-Scholes equation, whereas for the American option, we solve the linear complementarity problem formulation. Two-asset exotic option values are also computed, since ANNs enable the accurate valuation of high-dimensional options. The resulting errors of the ANN approach are assessed by comparing to the analytic option values or to numerical reference solutions (for American options, computed by finite elements).
    Date: 2020–05
  7. By: Jesús Fernández-Villaverde (University of Pennsylvania, NBER and CEPR); Samuel Hurtado (Banco de España); Galo Nuño (Banco de España)
    Abstract: We postulate a nonlinear DSGE model with a financial sector and heterogeneous households. In our model, the interaction between the supply of bonds by the financial sector and the precautionary demand for bonds by households produces significant endogenous aggregate risk. This risk induces an endogenous regime-switching process for output, the risk-free rate, excess returns, debt, and leverage. The regime-switching generates i) multimodal distributions of the variables above; ii) time-varying levels of volatility and skewness for the same variables; and iii) supercycles of borrowing and deleveraging. All of these are important properties of the data. In comparison, the representative household version of the model cannot generate any of these features. Methodologically, we discuss how nonlinear DSGE models with heterogeneous agents can be efficiently computed using machine learning and how they can be estimated with a likelihood function, using inference with diffusions.
    Keywords: heterogeneous agents, wealth distribution, financial frictions, continuoustime, machine learning, neural networks, structural estimation, likelihood function
    JEL: C45 C63 E32 E44 G01 G11
    Date: 2020–06
  8. By: Oguzhan Cepni (Central Bank of the Republic of Turkey, Haci Bayram Mah. Istiklal Cad. No:10 06050, Ankara, Turkey); Rangan Gupta (Department of Economics, University of Pretoria, Pretoria, 0002, South Africa); Yigit Onay (Central Bank of the Republic of Turkey, Haci Bayram Mah. Istiklal Cad. No:10 06050, Ankara, Turkey)
    Abstract: This paper analyzes the predictive ability of aggregate and dis-aggregate proxies of investor sentiment, over and above standard macroeconomic predictors, in forecasting housing returns in China, using an array of machine learning models. Using a monthly out-of-sample period of 2011:01 to 2018:12, given an in-sample of 2006:01-2010:12, we find that indeed the new aligned investor sentiment index proposed in this paper has greater predictive power for housing returns than the a principal component analysis (PCA)-based sentiment index, used earlier in the literature. Moreover, shrinkage models utilizing the dis-aggregate sentiment proxies do not result in forecast improvement indicating that aligned sentiment index optimally exploits information in the dis-aggregate proxies of investor sentiment. Furthermore, when we let the machine learning models to choose from all key control variables and the aligned sentiment index, the forecasting accuracy is improved at all forecasting horizons, rather than just the short-run as witnessed under standard predictive regressions. This result suggests that machine learning methods are flexible enough to capture both structural change and time-varying information in a set of predictors simultaneously to forecast housing returns of China in a precise manner. Given the role of the real estate market in China’s economic growth, our result of accurate forecasting of housing returns, based on investor sentiment and macroeconomic variables using state-of-the-art machine learning methods, has important implications for both investors and policymakers.
    Keywords: Housing prices, Investor sentiment, Bayesian shrinkage, Time-varying parameter model
    JEL: C22 C32 C52 G12 R31
    Date: 2020–06
  9. By: Athey, Susan (Stanford U); Bryan, Kevin (U of Toronto); Gans, Joshua S. (U of Toronto)
    Abstract: The allocation of decision authority by a principal to either a human agent or an artificial intelligence (AI) is examined. The principal trades off an AI's more aligned choice with the need to motivate the human agent to expend effort in learning choice payoffs. When agent effort is desired, it is shown that the principal is more likely to give that agent decision authority, reduce investment in AI reliability and adopt an AI that may be biased. Organizational design considerations are likely to impact on how AIs are trained.
    Date: 2020–01
  10. By: Imanol Perez Arribas; Cristopher Salvi; Lukasz Szpruch
    Abstract: Mathematical models, calibrated to data, have become ubiquitous to make key decision processes in modern quantitative finance. In this work, we propose a novel framework for data-driven model selection by integrating a classical quantitative setup with a generative modelling approach. Leveraging the properties of the signature, a well-known path-transform from stochastic analysis that recently emerged as leading machine learning technology for learning time-series data, we develop the Sig-SDE model. Sig-SDE provides a new perspective on neural SDEs and can be calibrated to exotic financial products that depend, in a non-linear way, on the whole trajectory of asset prices. Furthermore, we our approach enables to consistently calibrate under the pricing measure $\mathbb Q$ and real-world measure $\mathbb P$. Finally, we demonstrate the ability of Sig-SDE to simulate future possible market scenarios needed for computing risk profiles or hedging strategies. Importantly, this new model is underpinned by rigorous mathematical analysis, that under appropriate conditions provides theoretical guarantees for convergence of the presented algorithms.
    Date: 2020–05
  11. By: Gabriel E. Kreindler (Harvard University); Yuhei Miyauchi (Boston University)
    Abstract: We show how commuting flows can be used to infer the spatial distribution of income within a city. We use a simple workplace choice model, which predicts a gravity equation for commuting flows whose destination fixed effects correspond to wages. We implement this method with cell phone transaction data from Dhaka and Colombo. Model-predicted income predicts separate income data, at the workplace and residential level. Unlike machine learning approaches, our method does not require training data, yet achieves comparable predictive power. In an application, we show that hartals (transportation strikes) in Dhaka lower commuting, leading to 5-8% lower predicted income.
    JEL: C55 E24 R14
    Date: 2019–02
  12. By: Riccardo Doyle
    Abstract: Interbank contagion can theoretically exacerbate losses in a financial system and lead to additional cascade defaults during downturn. In this paper we produce default analysis using both regression and neural network models to verify whether interbank contagion offers any predictive explanatory power on default events. We predict defaults of U.S. domiciled commercial banks in the first quarter of 2010 using data from the preceding four quarters. A number of established predictors (such as Tier 1 Capital Ratio and Return on Equity) are included alongside contagion to gauge if the latter adds significance. Based on this methodology, we conclude that interbank contagion is extremely explanatory in default prediction, often outperforming more established metrics, in both regression and neural network models. These findings have sizeable implications for the future use of interbank contagion as a variable of interest for stress testing, bank issued bond valuation and wider bank default prediction.
    Date: 2020–05
  13. By: Grilli, Luca; Santoro, Domenico
    Abstract: In this paper we want to demonstrate how it is possible to improve the forecast by using Boltzmann entropy like the classic financial indicators, throught neural networks. In particular, we show how it is possible to increase the scope of entropy by moving from cryptocurrencies to equities and how this type of architectures highlight the link between the indicators and the information that they are able to contain.
    Keywords: Neural Network; Price Forecasting; LSTM; Entropy
    JEL: C45 E37 F17 G17
    Date: 2020–05–22
  14. By: PALETTA, FRANCISCO CARLOS; da Conceição, Rafael Sena
    Abstract: O avanço da Tecnologia da Informação tem levado ao crescimento exponencial do volume de dados. O desafio que se apresenta é, diante deste universo, conseguir identificar, ao longo do processo de tomada de decisão estratégica, os dados com verdadeiro potencial de geração de valor à organização e, em seguida, transformar este potencial em vantagem competitiva. A metodologia desta pesquisa esta baseada em revisão da literatura e posterior estudo qualitativo e quantitativo, através da aplicação de questionário a provedores de informações de inteligência em Fazenda Pública. Os resultados demonstraram que, em regra, há um alinhamento de expectativas entre a alta administração do órgão público e os provedores de informação. A tomada de decisão estratégica e os resultados organizacionais são fortemente impactados pela gigantesca quantidade de dados disponíveis na era digital e a capacidade de sua organização e análise.
    Date: 2019–04–10
  15. By: Aurore Oskar KOWALEWSKI (IESEG School of Management & LEM-CNRS 9221); Paweł PISANY (Institute of Economics, Polish Academy of Sciences)
    Abstract: This study investigates the determinants of fintech company creation and activity using a cross-country sample that includes developed and developing countries. Using a random effect negative binomial model and explainable machine learning algorithms, we show the positive role of technology advancements in each economy, quality of research, and more importantly, the level of university-industry collaboration. Additionally, we find that demographic factors may play a role in fintech creation and activity. Some fintech companies may find the quality and stringency of regulation to be an obstacle. Our results also show the sophisticated interactions between the banking sector and fintech companies that we may describe as a mix of cooperation and competition.
    Keywords: fintech, innovation, start up, developed countries, developing countries
    JEL: G21 G23 L26 O30
    Date: 2020–07
  16. By: Dickinson, Jeffrey
    Abstract: This paper seeks to advance understanding of the lights-income relationship by linking the newest generation of night-time satellite images, the VIIRS images, to nationwide, panel data on 3,101 US counties, including data on both population and income. I leverage the quality and frequency of those data sources and the VIIRS lights images to decompose the links between population growth, official GDP growth, and nighttime lights growth at the county level. I use a between-county estimator to identify the effects of time-invariant infrastructure features on night-time light. Roads, rail, ports, and airports I find to be strong contributors to increases in light. I find GDP growth is weakly linked with night-time lights though light growth is strongly linked with population growth even when controlling for substantial non-linearities which appear to be present.
    Keywords: night-time light, GDP, population, infrastructure, regional development
    JEL: C82 O51 R10 R11 R12
    Date: 2020
  17. By: Calypso Herrera; Florian Krach; Josef Teichmann
    Abstract: Continuous stochastic processes are widely used to model time series that exhibit a random behaviour. Predictions of the stochastic process can be computed by the conditional expectation given the current information. To this end, we introduce the controlled ODE-RNN that provides a data-driven approach to learn the conditional expectation of a stochastic process. Our approach extends the ODE-RNN framework which models the latent state of a recurrent neural network (RNN) between two observations with a neural ordinary differential equation (neural ODE). We show that controlled ODEs provide a general framework which can in particular describe the ODE-RNN, combining in a single equation the continuous neural ODE part with the jumps introduced by RNN. We demonstrate the predictive capabilities of this model by proving that, under some regularities assumptions, the output process converges to the conditional expectation process.
    Date: 2020–06
  18. By: Lars Elend; Sebastian A. Tideman; Kerstin Lopatta; Oliver Kramer
    Abstract: In the financial sector, a reliable forecast the future financial performance of a company is of great importance for investors' investment decisions. In this paper we compare long-term short-term memory (LSTM) networks to temporal convolution network (TCNs) in the prediction of future earnings per share (EPS). The experimental analysis is based on quarterly financial reporting data and daily stock market returns. For a broad sample of US firms, we find that both LSTMs outperform the naive persistent model with up to 30.0% more accurate predictions, while TCNs achieve and an improvement of 30.8%. Both types of networks are at least as accurate as analysts and exceed them by up to 12.2% (LSTM) and 13.2% (TCN).
    Date: 2020–06
  19. By: Hassan, Tarek Alexander; Hollander, Stephan; Tahoun, Ahmed; van Lent, Laurence
    Abstract: Using tools described in our earlier work Hassan et al. (2019,2020), we develop text-based measures of the costs, benefits, and risks listed firms in the US and over 80 other countries associate with the spread of Covid-19 and other epidemic diseases. We identify which firms expect to gain or lose from an epidemic disease and which are most affected by the associated uncertainty as a disease spreads in a region or around the world. As Covid-19 spreads globally in the first quarter of 2020, we find that firms' primary concerns relate to the collapse of demand, increased uncertainty, and disruption in supply chains. Other important concerns relate to capacity reductions, closures, and employee welfare. By contrast, financing concerns are mentioned relatively rarely. We also identify some firms that foresee opportunities in new or disrupted markets due to the spread of the disease. Finally, we find some evidence that firms that have experience with SARS or H1N1 have more positive expectations about their ability to deal with the coronavirus outbreak.
    Keywords: Epidemic diseases; exposure; firms; Machine Learning; Pandemic; sentiment; uncertainty; virus
    JEL: D22 E0 F0 G15 I15 I18
    Date: 2020–04
  20. By: Emaad Manzoor; George H. Chen; Dokyun Lee; Michael D. Smith
    Abstract: Deliberation among individuals online plays a key role in shaping the opinions that drive votes, purchases, donations and other critical offline behavior. Yet, the determinants of opinion-change via persuasion in deliberation online remain largely unexplored. Our research examines the persuasive power of $\textit{ethos}$ -- an individual's "reputation" -- using a 7-year panel of over a million debates from an argumentation platform containing explicit indicators of successful persuasion. We identify the causal effect of reputation on persuasion by constructing an instrument for reputation from a measure of past debate competition, and by controlling for unstructured argument text using neural models of language in the double machine-learning framework. We find that an individual's reputation significantly impacts their persuasion rate above and beyond the validity, strength and presentation of their arguments. In our setting, we find that having 10 additional reputation points causes a 31% increase in the probability of successful persuasion over the platform average. We also find that the impact of reputation is moderated by characteristics of the argument content, in a manner consistent with a theoretical model that attributes the persuasive power of reputation to heuristic information-processing under cognitive overload. We discuss managerial implications for platforms that facilitate deliberative decision-making for public and private organizations online.
    Date: 2020–06
  21. By: Mihalyi, David; Mate, Akos
    Abstract: This article introduces an original panel dataset based on the text of country reports by the International Monetary Fund. It consists of a total of 5561 Article IV consultation and program review documents, published between 2004 and 2018 on 201 countries. The text of these reports provide indications of the perceived policy weaknesses, economic risks, ongoing reforms and implemented or neglected policy advice. Thus the content of IMF reports are widely used in the economics, political science and IR literature. To our knowledge this is the first comprehensive dataset that aggregates these country reports. The paper gives a detailed account on the data acquisition and management process. To demonstrate and validate the dataset’s application for research we present three validation exercises. We find that Article IV reports can indicate incoming institutional reforms, show changes in IMF policy advice overtime and identify potential gains from recently discovered natural resources in certain cases. Taken together, this paper contributes an original dataset of IMF country reports and demonstrates how it can be a useful foundation for further research into the role of international financial institutions.
    Keywords: economic policy, IMF, text analysis, original dataset
    JEL: E60 F53
    Date: 2019–08–02

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.