nep-big New Economics Papers
on Big Data
Issue of 2019‒02‒25
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. Cross-validation based forecasting method: a machine learning approach By Pinto, Jeronymo Marcondes; Marçal, Emerson Fernandes
  2. Does Scientific Progress Affect Culture? A Digital Text Analysis By Michela Giorcelli; Nicola Lacetera; Astrid Marinoni
  3. Stacking with Neural network for Cryptocurrency investment By Avinash Barnwal; Haripad Bharti; Aasim Ali; Vishal Singh
  4. Artificial intelligence, jobs, inequality and productivity: Does aggregate demand matter? By Gries, Thomas; Naude, Wim
  5. Constrained Risk Budgeting Portfolios: Theory, Algorithms, Applications & Puzzles By Jean-Charles Richard; Thierry Roncalli
  6. Agent based model calibration using machine learning surrogates By Francesco Lamperti; Andrea Roventini; Amir Sani
  7. The Long-Run Information Effect of Central Bank Communication By Hansen, Stephen; McMahon, Michael; Tong, Matthew
  8. Modified Causal Forests for Estimating Heterogeneous Causal Effects By Lechner, Michael
  9. Folklore By Michalopoulos, Stelios; Xue, Melanie Meng
  10. The use of prior information in very robust regression for fraud detection By Riani, Marco; Corbellini, Aldo; Atkinson, Anthony C.
  11. Macroeconomic time-series evidence that energy efficiency improvements do not save energy By Stephan B. Bruns; Alessio Moneta; David I. Stern
  12. Big data analytics and organizational culture as complements to swift trust and collaborative performance in the humanitarian supply chain By Rameshwar Dubey; Angappa Gunasekaran; Stephen Childe; David Roubaud; Samuel Fosso Wamba; Mihalis Giannakis; Cyril Foropon
  13. Statistical profiling in public employment services: An international comparison By Sam Desiere; Kristine Langenbucher; Ludo Struyven
  14. Supervised Deep Neural Networks (DNNs) for Pricing/Calibration of Vanilla/Exotic Options Under Various Different Processes By Ali Hirsa; Tugce Karatas; Amir Oskoui
  15. Let There Be Light: Trade and the Development of Border Regions By Brülhart, Marius; Cadot, Olivier; Himbert, Alexander
  16. Deep Adaptive Input Normalization for Price Forecasting using Limit Order Book Data By Nikolaos Passalis; Anastasios Tefas; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
  17. Natural Disasters and Regional Development - The Case of Earthquakes By Marius Fabian; Christian Leßmann; Tim Sofke
  18. Bringing Satellite-Based Air Quality Estimates Down to Earth By Meredith Fowlie; Edward A. Rubin; Reed Walker
  19. Measuring the size and growth of cities using nighttime light By Ch, R.; Martin, D.; Vargas, J.

  1. By: Pinto, Jeronymo Marcondes; Marçal, Emerson Fernandes
    Abstract: Our paper aims to evaluate two novel methods on selecting the best forecasting model or its combination based on a Machine Learning approach. The methods are based on the selection of the ”best” model, or combination of models, by crossvalidation technique, from a set of possible models. The first one is based on the seminal paper of Granger-Bates (1969) but weights are estimated by a process of cross-validation applied on the training set. The second one selects the model with the best forecasting performance in the process described above, which we called CvML (Cross-Validation Machine Learning Method). The following models are used: exponential smoothing, SARIMA, artificial neural networks and Threshold autoregression (TAR). Model specification is chosen by R packages: forecast and TSA. Both methods – CvML and MGB - are applied to these models to generate forecasts from one up to twelve periods ahead. Frequency of data is monthly. We run the forecasts exercise to the following to monthly series of Industrial Product Indices for seven countries: Canada, Brazil, Belgium, Germany, Portugal, UK and USA. The data was collected at OECD data, with 504 observations. We choose Average Forecast Combination, Granger Bates Method, MCS model, Naive and Seasonal Naive Model as benchmarks.Our results suggest that MGB did not performed well. However, CvML had a lower mean absolute error for most of countries and forecast horizons, particularly at longer horizons, surpassing all the proposed benchmarks. Similar results hold for absolute mean forecast error.
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:fgv:eesptd:498&r=all
  2. By: Michela Giorcelli; Nicola Lacetera; Astrid Marinoni
    Abstract: We study the interplay between scientific progress and culture through text analysis on a corpus of about eight million books, with the use of techniques and algorithms from machine learning. We focus on a specific scientific breakthrough, the theory of evolution through natural selection by Charles Darwin, and examine the diffusion of certain key concepts that characterized this theory in the broader cultural discourse and social imaginary. We find that some concepts in Darwin’s theory, such as Evolution, Survival, Natural Selection and Competition diffused in the cultural discourse immediately after the publication of On the Origins of Species. Other concepts such as Selection and Adaptation were already present in the cultural dialogue. Moreover, we document semantic changes for most of these concepts over time. Our findings thus show a complex relation between two key factors of long-term economic growth – science and culture. Considering the evolution of these two factors jointly can offer new insights to the study of the determinants of economic development, and machine learning is a promising tool to explore these relationships.
    Keywords: science, culture, economic history, text analysis, machine learning
    JEL: C19 C89 N00 O00 O39 Z19
    Date: 2019
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_7499&r=all
  3. By: Avinash Barnwal; Haripad Bharti; Aasim Ali; Vishal Singh
    Abstract: Predicting the direction of assets have been an active area of study and a difficult task. Machine learning models have been used to build robust models to model the above task. Ensemble methods is one of them showing results better than a single supervised method. In this paper, we have used generative and discriminative classifiers to create the stack, particularly 3 generative and 9 discriminative classifiers and optimized over one-layer Neural Network to model the direction of price cryptocurrencies. Features used are technical indicators used are not limited to trend, momentum, volume, volatility indicators, and sentiment analysis has also been used to gain useful insight combined with the above features. For Cross-validation, Purged Walk forward cross-validation has been used. In terms of accuracy, we have done a comparative analysis of the performance of Ensemble method with Stacking and Ensemble method with blending. We have also developed a methodology for combined features importance for the stacked model. Important indicators are also identified based on feature importance.
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1902.07855&r=all
  4. By: Gries, Thomas (Universitat Paderborn); Naude, Wim (UNU-MERIT, and Maastricht University, RWTH Aachen University, IZA Institute of Labor Economics, Bonn.)
    Abstract: Rapid technological progress in artificial intelligence (AI) has been predicted to lead to mass unemployment, rising inequality, and higher productivity growth through automation. In this paper we critically re-assess these predictions by (i) surveying the recent literature and (ii) incorporating AI-facilitated automation into a product variety-model, frequently used in endogenous growth theory, but modified to allow for demand-side constraints. This is a novel approach, given that endogenous growth models, and including most recent work on AI in economic growth, are largely supply-driven. Our contribution is motivated by two reasons. One is that there are still only very few theoretical models of economic growth that incorporate AI, and moreover an absence of growth models with AI that takes into consideration growth constraints due to insuficient aggregate demand. A second is that the predictions of AI causing massive job losses and faster growth in productivity and GDP are at odds with reality so far: if anything, unemployment in many advanced economies is historically low. However, wage growth and productivity is stagnating and inequality is rising. Our paper provides a theoretical explanation of this in the context of rapid progress in AI.
    Keywords: Technology, artificial intelligence, productivity, labour demand, innovation, growth theory
    JEL: O47 O33 J24 E21 E25
    Date: 2018–12–12
    URL: http://d.repec.org/n?u=RePEc:unm:unumer:2018047&r=all
  5. By: Jean-Charles Richard; Thierry Roncalli
    Abstract: This article develops the theory of risk budgeting portfolios, when we would like to impose weight constraints. It appears that the mathematical problem is more complex than the traditional risk budgeting problem. The formulation of the optimization program is particularly critical in order to determine the right risk budgeting portfolio. We also show that numerical solutions can be found using methods that are used in large-scale machine learning problems. Indeed, we develop an algorithm that mixes the method of cyclical coordinate descent (CCD), alternating direction method of multipliers (ADMM), proximal operators and Dykstra's algorithm. This theoretical body is then applied to some investment problems. In particular, we show how to dynamically control the turnover of a risk parity portfolio and how to build smart beta portfolios based on the ERC approach by improving the liquidity of the portfolio or reducing the small cap bias. Finally, we highlight the importance of the homogeneity property of risk measures and discuss the related scaling puzzle.
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1902.05710&r=all
  6. By: Francesco Lamperti (Université Panthéon-Sorbonne - Paris 1 (UP1)); Andrea Roventini (Observatoire français des conjonctures économiques); Amir Sani (Université Panthéon-Sorbonne X)
    Abstract: Efficiently calibrating agent-based models (ABMs) to real data is an open challenge. This paper explicitly tackles parameter space exploration and calibration of ABMs by combining machine-learning and intelligent iterative sampling. The proposed approach “learns” a fast surrogate meta-model using a limited number of ABM evaluations and approximates the nonlinear relationship between ABM inputs (initial conditions and parameters) and outputs. Performance is evaluated on the Brock and Hommes (1998) asset pricing model and the “Islands” endogenous growth model Fagiolo and Dosi (2003). Results demonstrate that machine learning surrogates obtained using the proposed iterative learning procedure provide a quite accurate proxy of the true model and dramatically reduce the computation time necessary for large scale parameter space exploration and calibration.
    Keywords: Agent based model; Calibration; Machine learning; Surrogate; Meta-model
    JEL: C15 C52 C63
    Date: 2018–05
    URL: http://d.repec.org/n?u=RePEc:spo:wpmain:info:hdl:2441/13thfd12aa8rmplfudlgvgahff&r=all
  7. By: Hansen, Stephen; McMahon, Michael; Tong, Matthew
    Abstract: Why do long-run interest rates respond to central bank communication? Whereas existing explanations imply a common set of signals drives short and long-run yields, we show that news on economic uncertainty can have increasingly large effects along the yield curve. To evaluate this channel, we use the publication of the Bank of England's Inflation Report, from which we measure a set of high-dimensional signals. The signals that drive long-run interest rates do not affect short-run rates and operate primarily through the term premium. This suggests communication plays an important role in shaping perceptions of long-run uncertainty.
    Keywords: communication; Machine Learning; monetary policy
    JEL: E52 E58
    Date: 2019–01
    URL: http://d.repec.org/n?u=RePEc:cpr:ceprdp:13438&r=all
  8. By: Lechner, Michael
    Abstract: Uncovering the heterogeneity of causal effects of policies and business decisions at various levels of granularity provides substantial value to decision makers. This paper develops new estimation and inference procedures for multiple treatment models in a selection-on-observables framework by modifying the Causal Forest approach suggested by Wager and Athey (2018). The new esti-mators have desirable theoretical and computational properties for various aggregation levels of the causal effects. An Empirical Monte Carlo study shows that they may outperform previously suggested estimators. Inference tends to be accurate for effects relating to larger groups and conservative for effects relating to fine levels of granularity. An application to the evaluation of an active labour mar-ket programme shows the value of the new methods for applied research.
    Keywords: average treatment effects; causal forests; Causal machine learning; conditional aver-age treatment effects; multiple treatments; selection-on-observable; statistical learning
    JEL: C21 J68
    Date: 2019–01
    URL: http://d.repec.org/n?u=RePEc:cpr:ceprdp:13430&r=all
  9. By: Michalopoulos, Stelios; Xue, Melanie Meng
    Abstract: Folklore is the collection of traditional beliefs, customs, and stories of a community, passed through the generations by word of mouth. This vast expressive body, studied by the corresponding discipline of folklore, has evaded the attention of economists. In this study we do four things that reveal the tremendous potential of this corpus for understanding comparative development and culture. First, we introduce and describe a unique catalogue of folklore that codes the presence of thousands of motifs for roughly 1,000 pre-industrial societies. Second, we use a dictionary-based approach to elicit group-specific measures of various traits related to the natural environment, institutional framework, and mode of subsistence. We establish that these proxies are in accordance with the ethnographic record, and illustrate how to use a group's oral tradition to quantify non-extant characteristics of preindustrial societies. Third, we use folklore to uncover the historical cultural values of a group. Doing so allows us to test various influential conjectures among social scientists including the original affluent society, the culture of honor among pastoralists, the role of family in extended kinship systems and the intensity of trade and rule-following norms in politically centralized group. Finally, we explore how cultural norms inferred via text analysis of oral traditions predict contemporary attitudes and beliefs.
    Keywords: Culture; Development; Folklore; History; Values
    JEL: O10 Z1 Z10 Z13
    Date: 2019–01
    URL: http://d.repec.org/n?u=RePEc:cpr:ceprdp:13425&r=all
  10. By: Riani, Marco; Corbellini, Aldo; Atkinson, Anthony C.
    Abstract: Misinvoicing is a major tool in fraud including money laundering. We develop a method of detecting the patterns of outliers that indicate systematic mis‐pricing. As the data only become available year by year, we develop a combination of very robust regression and the use of ‘cleaned’ prior information from earlier years, which leads to early and sharp indication of potentially fraudulent activity that can be passed to legal agencies to institute prosecution. As an example, we use yearly imports of a specific seafood into the European Union. This is only one of over one million annual data sets, each of which can currently potentially contain 336 observations. We provide a solution to the resulting big data problem, which requires analysis with the minimum of human intervention.
    Keywords: big data; data cleaning; forward search; MM estimation; misinvoicing; money laundering; seafood; timeliness
    JEL: C1 F3 G3
    Date: 2018–08–01
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:87685&r=all
  11. By: Stephan B. Bruns; Alessio Moneta; David I. Stern
    Abstract: The size of the economy-wide rebound effect is crucial for estimating the contribution that energy efficiency improvements can make to reducing energy use and greenhouse gas emissions. We provide the first empirical general equilibrium estimate of the economy-wide rebound effect. We use a structural vector autoregressive (SVAR) model that is estimated using search methods developed in machine learning. We apply the SVAR to U.S. monthly and quarterly data, finding that after four years rebound is around 100%. This implies that policies to encourage cost-reducing energy efficiency innovation are not likely to significantly reduce energy use and greenhouse gas emissions.
    JEL: C32 Q43
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:een:camaaa:2019-21&r=all
  12. By: Rameshwar Dubey (Montpellier Business School (MBS), Montpellier Research in Management (MRM)); Angappa Gunasekaran (School of Business and Public Administration, California State University); Stephen Childe (University of Plymouth Business School); David Roubaud (Montpellier Business School (MBS), Montpellier Research in Management (MRM)); Samuel Fosso Wamba (Toulouse Business School); Mihalis Giannakis (Audencia Business School - Audencia Business School); Cyril Foropon (Montpellier Business School (MBS), Montpellier Research in Management (MRM))
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:hal:journl:hal-01996486&r=all
  13. By: Sam Desiere; Kristine Langenbucher; Ludo Struyven
    Abstract: Profiling tools help to deliver employment services more efficiently. They can ensure that more costly, intensive services are targeted at jobseekers most at risk of becoming long term unemployed. Moreover, the detailed information on the employment barriers facing jobseekers obtained through the profiling process can be used to tailor services more closely to their individual needs. While other forms of profiling exist, the focus is on statistical profiling, which makes use of statistical models to predict jobseekers’ likelihood of becoming long-term unemployed. An overview on profiling tools currently used throughout the OECD is presented, considerations for the development of such tools, and some insights into the latest developments such as using “click data” on job searches and advanced machine learning techniques. Also discussed are the limitations of statistical profiling tools and options for policymakers on how to address those in the development and implementation of statistical profiling tools.
    Keywords: active labour market policy, caseworkers, employment barrier, jobseekers, selection, statistical profiling, targeting, unemployment
    JEL: J64 J68
    Date: 2019–02–18
    URL: http://d.repec.org/n?u=RePEc:oec:elsaab:224-en&r=all
  14. By: Ali Hirsa; Tugce Karatas; Amir Oskoui
    Abstract: We apply supervised deep neural networks (DNNs) for pricing and calibration of both vanilla and exotic options under both diffusion and pure jump processes with and without stochastic volatility. We train our neural network models under different number of layers, neurons per layer, and various different activation functions in order to find which combinations work better empirically. For training, we consider various different loss functions and optimization routines. We demonstrate that deep neural networks exponentially expedite option pricing compared to commonly used option pricing methods which consequently make calibration and parameter estimation super fast.
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1902.05810&r=all
  15. By: Brülhart, Marius; Cadot, Olivier; Himbert, Alexander
    Abstract: Does international trade help or hinder the economic development of border regions relative to interior regions? Theory tends to suggest that trade helps, but it can also predict the reverse. The question is policy relevant as regions near land borders are generally poorer, and sometimes more prone to civil conflict, than interior regions. We therefore estimate how changes in bilateral trade volumes affect economic activity along roads running inland from international borders, using satellite night-light measurements for 2,186 border-crossing roads in 138 countries. We observe a significant 'border shadow': on average, lights are 37 percent dimmer at the border than 200 kilometers inland. We find this difference to be reduced by trade expansion as measured by exports and instrumented with tariffs on the opposite side of the border. At the mean, a doubling of exports to a particular neighbor country reduces the gradient of light from the border by some 23 percent. This qualitative finding applies to developed and developing countries, and to rural and urban border regions. Proximity to cities on either side of the border amplifies the effects of trade. We provide evidence that local export-oriented production is a significant mechanism behind the observed effects.
    Keywords: border regions; Economic Geography; night lights data; trade liberalization
    JEL: F14 F15 R11 R12
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:cpr:ceprdp:13515&r=all
  16. By: Nikolaos Passalis; Anastasios Tefas; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
    Abstract: Deep Learning (DL) models can be used to tackle time series analysis tasks with great success. However, the performance of DL models can degenerate rapidly if the data are not appropriately normalized. This issue is even more apparent when DL is used for financial time series forecasting tasks, where the non-stationary and multimodal nature of the data pose significant challenges and severely affect the performance of DL models. In this work, a simple, yet effective, neural layer, that is capable of adaptively normalizing the input time series, while taking into account the distribution of the data, is proposed. The proposed layer is trained in an end-to-end fashion using back-propagation and can lead to significant performance improvements. The effectiveness of the proposed method is demonstrated using a large-scale limit order book dataset.
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1902.07892&r=all
  17. By: Marius Fabian; Christian Leßmann; Tim Sofke
    Abstract: We analyze the impact of earthquakes on nighttime lights at a sub-national level, i.e. on grids of different size. We argue that existing studies on the impact of natural disasters on economic development have several important limitations, both at the level of the outcome variable – usually national income or growth – as well as on the level of the independent variable, e.g. the timing of an event and the measuring of its intensity. We aim to overcome these limitations by using geophysical event data on earthquakes together with satellite nighttime lights. Using panel fixed effects regressions covering the entire world for the period 1992-2013 we find that earthquakes reduce both light growth rates and light levels significantly. The effects are persistent for approximately 5 years, but we find no long run effects. The effects are strong and robust in a small grid and gets weaker the larger the unit of observation. National institutions and economic conditions are relevant mediating factors.
    Keywords: natural disasters, earthquakes, event data, satellite nighttime lights, luminosity, grid data, institutions, growth, development
    JEL: O44 Q54
    Date: 2019
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_7511&r=all
  18. By: Meredith Fowlie; Edward A. Rubin; Reed Walker
    Abstract: We use state-of-the-art, satellite-based PM2.5 estimates to assess the extent to which the EPA's existing, monitor-based measurements over- or under-estimate true exposure to PM2.5 pollution. Treating satellite-based estimates as truth implies a substantial number of "policy errors"—over-regulating areas that comply with air quality standards and under-regulating other areas that appear to violate standards. We investigate the health implications of these apparent errors and highlight the importance of accounting for prediction error in satellite-based estimates. Uncertainty in "policy errors" increases substantially when we account for these underlying prediction errors.
    JEL: H23 H41 Q50 Q52 Q53
    Date: 2019–02
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:25560&r=all
  19. By: Ch, R.; Martin, D.; Vargas, J.
    Abstract: This paper uses high-resolution images of nighttime luminosity to estimate a globally comparable measure of the size of metropolitan areas around the world for the years 2000 and 2010. We apply recently-proposed methodologies that correct the known problems of available nighttime luminosity data including blurring, instability of lit pixels overtime and the reduced comparability of night light images across satellites and across time. We then develop a protocol that isolates stable nighttime light pixels that constitute urban footprint, including low luminosity urban settlements such as slums, and excluding confounding phenomena such as highway illumination. When analyzed together with existing geo-referenced population datasets, our measure of urban footprint, can be used to compute city densities for the entire world. After characterizing some basic stylized facts regarding the distribution of urban sprawl, urban population and population density across world regions, we offer an application of our measure to the study of the size distribution of cities, including test of the Zipf's Law and Gibrat's Law.
    Keywords: Ciudades, Desarrollo, Desarrollo urbano, Economía, Políticas públicas,
    Date: 2018
    URL: http://d.repec.org/n?u=RePEc:dbl:dblwop:1279&r=all

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.