nep-big New Economics Papers
on Big Data
Issue of 2019‒03‒11
thirteen papers chosen by
Tom Coupé
University of Canterbury

  1. Using Artificial Intelligence to Recapture Norms: Did #metoo change gender norms in Sweden? By Sara Moricz
  2. Artificial Intelligence: The Ambiguous Labor Market Impact of Automating Prediction By Ajay Agrawal; Joshua S. Gans; Avi Goldfarb
  3. Metrics for Measuring the Performance of Machine Learning Prediction Models: An Application to the Housing Market By Miriam Steurer; Robert Hill
  4. Liquidity Management of Canadian Corporate Bond Mutual Funds: A Machine Learning Approach By Rohan Arora; Chen Fan; Guillaume Ouellet Leblanc
  5. Syria in the Dark: Estimating the Economic Consequences of the Civil War through Satellite-Derived Night Time Lights By Giorgia Giovannetti; Elena Perra
  6. Forecasting Economics and Financial Time Series: ARIMA vs. LSTM By Sima Siami-Namini; Akbar Siami Namin
  7. Big Data et pratiques de GRH By Clotilde Coron
  8. Identifying Bid Leakage In Procurement Auctions: Machine Learning Approach By Dmitry I. Ivanov; Alexander S. Nesterov
  9. Conditional Density Estimation with Neural Networks: Best Practices and Benchmarks By Jonas Rothfuss; Fabio Ferreira; Simon Walther; Maxim Ulrich
  10. Narratives About Technology-Induced Job Degradation Then and Now By Robert J. Shiller
  11. Model Selection in Utility-Maximizing Binary Prediction By Jiun-Hua Su
  12. Gaussian Process Regression for Pricing Variable Annuities with Stochastic Volatility and Interest Rate By Ludovic Gouden\`ege; Andrea Molent; Antonino Zanette
  13. On binscatter By Cattaneo, Matias D.; Crump, Richard K.; Farrell, Max H.; Feng , Yingjie

  1. By: Sara Moricz
    Abstract: Norms are challenging to define and measure, but this paper takes advantage of text data and the recent development in machine learning to create an encompassing measure of norms. An LSTM neural network is trained to detect gendered language. The network functions as a tool to create a measure on how gender norms changes in relation to the Metoo movement on Swedish Twitter. This paper shows that gender norms on average are less salient half a year after the date of the first appearance of the hashtag #Metoo. Previous literature suggests that gender norms change over generations, but the current result suggests that norms can change in the short run.
    Date: 2019–03
  2. By: Ajay Agrawal; Joshua S. Gans; Avi Goldfarb
    Abstract: Recent advances in artificial intelligence are primarily driven by machine learning, a prediction technology. Prediction is useful because it is an input into decision-making. In order to appreciate the impact of artificial intelligence on jobs, it is important to understand the relative roles of prediction and decision tasks. We describe and provide examples of how artificial intelligence will affect labor, emphasizing differences between when automating prediction leads to automating decisions versus enhancing decision-making by humans.
    JEL: J20 O33
    Date: 2019–02
  3. By: Miriam Steurer (University of Graz, Austria); Robert Hill (University of Graz, Austria)
    Abstract: With the rapid growth of machine learning (ML) methods and datasets to which they can be applied, the question of how one can compare the predictive performance of competing models is becoming an issue of high importance. The existing literature is interdisciplinary, making it hard for users to locate and evaluate the set of available metrics. In this article we collect a number of such metrics from various sources. We classify them by type and then evaluate them with respect to two novel symmetry conditions. While none of these metrics satisfy both conditions, we propose a number of new metrics that do. In total we consider a portfolio of 56 performance metrics. To illustrate the problem of choosing between them, we provide an application in which five ML methods are used to predict apartment prices. We show that the most popular metrics for evaluating performance in the AVM literature generate misleading results. A different picture emerges when the full set of metrics is considered, and especially when we focus on the class of metrics with the best symmetry properties. We conclude by recommending four key metrics for evaluating model predictive performance.
    Keywords: Machine learning; Performance metric; Prediction error; Automated valuation model
    JEL: C45 C53
    Date: 2019–02
  4. By: Rohan Arora; Chen Fan; Guillaume Ouellet Leblanc
    Abstract: How do Canadian corporate bond mutual funds meet investor redemptions? We revisit this question using decision tree and random forest algorithms. We uncover new patterns in the decisions made by fund managers: the interaction between a larger, market-wide term spread and relatively less-liquid holdings increases the probability that a fund manager will sell less-liquid assets (corporate bonds) to meet redemptions. The evidence also shows that machine learning algorithms can extract new knowledge that is not apparent using a classical linear modelling approach.
    Keywords: Financial markets; Financial stability
    JEL: G1 G20 G23
    Date: 2019
  5. By: Giorgia Giovannetti; Elena Perra
    Abstract: The Syrian Civil War has begun in 2011 and is still wrecking enormous damages on the country's economy, with an impressive toll measured in deaths, migration, and the destruction of the Syrian historical heritage and physical infrastructure. This paper examines the impact of the War on Syria's economy from the perspective of outer space, to bypass the issue of data availability due to the inaccessibility of the war-ravaged territory. The estimates obtained in this way are more pessimistic than the ones reported by international organisations. Starting from our estimates, we provide long-term projections for the country's economy, and estimate the window for GDP recovery at the pre-war levels. We discuss geo-political implications which could prevent our projections from happening.
    Keywords: Syria, War, GDP estimates, Night-Lights
    JEL: E01 O15 C82 H56
    Date: 2019
  6. By: Sima Siami-Namini; Akbar Siami Namin
    Abstract: Forecasting time series data is an important subject in economics, business, and finance. Traditionally, there are several techniques to effectively forecast the next lag of time series data such as univariate Autoregressive (AR), univariate Moving Average (MA), Simple Exponential Smoothing (SES), and more notably Autoregressive Integrated Moving Average (ARIMA) with its many variations. In particular, ARIMA model has demonstrated its outperformance in precision and accuracy of predicting the next lags of time series. With the recent advancement in computational power of computers and more importantly developing more advanced machine learning algorithms and approaches such as deep learning, new algorithms are developed to forecast time series data. The research question investigated in this article is that whether and how the newly developed deep learning-based algorithms for forecasting time series data, such as "Long Short-Term Memory (LSTM)", are superior to the traditional algorithms. The empirical studies conducted and reported in this article show that deep learning-based algorithms such as LSTM outperform traditional-based algorithms such as ARIMA model. More specifically, the average reduction in error rates obtained by LSTM is between 84 - 87 percent when compared to ARIMA indicating the superiority of LSTM to ARIMA. Furthermore, it was noticed that the number of training times, known as "epoch" in deep learning, has no effect on the performance of the trained forecast model and it exhibits a truly random behavior.
    Date: 2018–03
  7. By: Clotilde Coron (GREGOR - Groupe de Recherche en Gestion des Organisations - UP1 - Université Panthéon-Sorbonne - IAE Paris - Sorbonne Business School)
    Abstract: Le Big Data constitue un phénomène qui irrigue aujourd'hui nombre de domaines : marketing, biologie, justice… La définition commune du Big Data, issue du rapport de Gartner de 2001, se fonde essentiellement sur les caractéristiques des données mobilisées : volume, hétérogénéité des sources et du degré de structuration, mise à jour en temps réel des données… Cette définition peut sembler restrictive, mais d'autres définitions plus récentes et plus englobantes permettent d'identifier quelques dispositifs introduisant du Big Data dans les RH. En mobilisant les notions de dispositifs et de pratiques de GRH, et en nous centrant sur trois dispositifs mobilisant des données en RH, nous cherchons à qualifier les objectifs de modification de pratiques de GRH portés par les dispositifs de Big Data RH, autour de la personnalisation et de la prédiction.
    Keywords: Big Data,RH,Dispositifs de GRH,Pratiques de GRH
    Date: 2019
  8. By: Dmitry I. Ivanov; Alexander S. Nesterov
    Abstract: We propose a novel machine-learning-based approach to detect bid leakage in first-price sealed-bid auctions. We extract and analyze the data on more than 1.4 million Russian procurement auctions between 2014 and 2018. As bid leakage in each particular auction is tacit, the direct classification is impossible. Instead, we reduce the problem of bid leakage detection to Positive-Unlabeled Classification. The key idea is to regard the losing participants as fair and the winners as possibly corrupted. This allows us to estimate the prior probability of bid leakage in the sample, as well as the posterior probability of bid leakage for each specific auction. We find that at least 16\% of auctions are exposed to bid leakage. Bid leakage is more likely in auctions with a higher reserve price, lower number of bidders and lower price fall, and where the winning bid is received in the last hour before the deadline.
    Date: 2019–03
  9. By: Jonas Rothfuss; Fabio Ferreira; Simon Walther; Maxim Ulrich
    Abstract: Given a set of empirical observations, conditional density estimation aims to capture the statistical relationship between a conditional variable $\mathbf{x}$ and a dependent variable $\mathbf{y}$ by modeling their conditional probability $p(\mathbf{y}|\mathbf{x})$. The paper develops best practices for conditional density estimation for finance applications with neural networks, grounded on mathematical insights and empirical evaluations. In particular, we introduce a noise regularization and data normalization scheme, alleviating problems with over-fitting, initialization and hyper-parameter sensitivity of such estimators. We compare our proposed methodology with popular semi- and non-parametric density estimators, underpin its effectiveness in various benchmarks on simulated and Euro Stoxx 50 data and show its superior performance. Our methodology allows to obtain high-quality estimators for statistical expectations of higher moments, quantiles and non-linear return transformations, with very little assumptions about the return dynamic.
    Date: 2019–03
  10. By: Robert J. Shiller (Cowles Foundation, Yale University)
    Abstract: Concerns that technological progress degrades job opportunities have been expressed over much of the last two centuries by both professional economists and the general public. These concerns can be seen in narratives both in scholarly publications and in the news media. Part of the expressed concern about jobs has been about the potential for increased economic inequality. But another part of the concern has been about a perceived decline in job quality in terms of its effects on monotony vs creativity of work, individual sense of identity, power to act independently, and meaning of life. Public policy should take account of both of these concerns, inequality and job quality.
    Keywords: Labor-saving machines, Artificial intelligence, History of thought, Division of labor, Unemployment, Automation, Robotics
    JEL: N3 J0 B0 E2
    Date: 2019–02
  11. By: Jiun-Hua Su
    Abstract: The semiparametric maximum utility estimation proposed by Elliott and Lieli (2013) can be viewed as cost-sensitive binary classification; thus, its in-sample overfitting issue is similar to that of perceptron learning in the machine learning literature. Based on structural risk minimization, a utility-maximizing prediction rule (UMPR) is constructed to alleviate the in-sample overfitting of the maximum utility estimation. We establish non-asymptotic upper bounds on the difference between the maximal expected utility and the generalized expected utility of the UMPR. Simulation results show that the UMPR with an appropriate data-dependent penalty outweighs some common estimators in binary classification if the conditional probability of the binary outcome is misspecified, or a decision maker's preference is ignored.
    Date: 2019–03
  12. By: Ludovic Gouden\`ege; Andrea Molent; Antonino Zanette
    Abstract: In this paper we develop an efficient approach based on a Machine Learning technique which allows one to quickly evaluate insurance products considering stochastic volatility and interest rate. Specifically, following De Spiegeleer et al., we apply Gaussian Process Regression to compute the price and the Greeks of a GMWB Variable Annuity. Starting from observed prices previously computed by means of a Hybrid Tree PDE approach for some known combinations of model parameters, it is possible to approximate the whole target function on a bounded domain. The regression algorithm consists of two main steps: algorithm training and evaluation. In particular, the first step is the most time demanding, but it needs to be performed only once, while the prediction step is very fast and requires to be performed only when evaluating the function. The developed method, as well as for the calculation of prices and Greeks, can also be employed to compute the no-arbitrage fee, which is a common practice in the Variable Annuities sector. We consider three increasing complexity models, namely the Black-Scholes, the Heston and the Heston Hull-White models, which extend the sources of randomness up to consider stochastic volatility and stochastic interest rate together. Numerical experiments show that the accuracy of the estimated values is high, while the computational cost is much lower than the one required by a direct calculation with standard approaches. Finally, we stress out that the analysis is carried out for a GMWB annuity but it could be generalized to other insurance products. Machine Learning seems to be a very promising and interesting tool for insurance risk management.
    Date: 2019–03
  13. By: Cattaneo, Matias D. (University of Michigan); Crump, Richard K. (Federal Reserve Bank of New York); Farrell, Max H. (University of Chicago); Feng , Yingjie (University of Michigan)
    Abstract: Binscatter is very popular in applied microeconomics. It provides a flexible, yet parsimonious way of visualizing and summarizing “big data” in regression settings, and it is often used for informal testing of substantive hypotheses such as linearity or monotonicity of the regression function. This paper presents a foundational, thorough analysis of binscatter: We give an array of theoretical and practical results that aid both in understanding current practices (that is, their validity or lack thereof) and in offering theory-based guidance for future applications. Our main results include principled number of bins selection, confidence intervals and bands, hypothesis tests for parametric and shape restrictions of the regression function, and several other new methods, applicable to canonical binscatter as well as higher-order polynomial, covariate-adjusted, and smoothness-restricted extensions thereof. In particular, we highlight important methodological problems related to covariate adjustment methods used in current practice. We also discuss extensions to clustered data. Our results are illustrated with simulated and real data throughout. Companion general-purpose software packages for Stata and R are provided. Finally, from a technical perspective, new theoretical results for partitioning-based series estimation are obtained that may be of independent interest.
    Keywords: binned scatter plot; regressogram; piecewise polynomials; splines; partitioning estimators; nonparametric regression; robust bias correction; uniform inference; binning selection
    JEL: C14 C18 C21
    Date: 2019–02–01

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.