nep-big 2020-02-03 papers

on Big Data

Issue of 2020‒02‒03
sixteen papers chosen by
Tom Coupé
University of Canterbury

Machine Labor By Joshua Angrist; Brigham Frandsen
Market Efficiency in the Age of Big Data By Ian Martin; Stefan Nagel
The Productivity and Unemployment Effects of the Digital Transformation: an Empirical and Modelling Assessment By Bertani, Filippo; Raberto, Marco; Teglio, Andrea
Improving Finite Sample Approximation by Central Limit Theorems for Estimates from Data Envelopment Analysis By Ya Chen; Mike Tsionas; Valentin Zelenyuk
An Artificial Intelligence approach to Shadow Rating By Angela Rita Provenzano; Daniele Trifir\`o; Nicola Jean; Giacomo Le Pera; Maurizio Spadaccino; Luca Massaron; Claudio Nordio
Credit growth, the yield curve and financial crisis prediction: evidence from a machine learning approach By Bluwstein, Kristina; Buckmann, Marcus; Joseph, Andreas; Kang, Miao; Kapadia, Sujit; Simsek, Özgür
Design of High-Frequency Trading Algorithm Based on Machine Learning By Boyue Fang; Yutong Feng
Economic policy uncertainty in the euro area: an unsupervised machine learning approach By Azqueta-Gavaldon, Andres; Hirschbühl, Dominik; Onorante, Luca; Saiz, Lorena
China's First Workforce Skill Taxonomy By Weipan Xu; Xiaozhen Qin; Xun Li; Haohui"Caron" Chen; Morgan Frank; Alex Rutherford; Andrew Reeson; Iyad Rahwan
The Production of Information in an Online World By Julia Cage; Nicolas Hervé; Marie-Luce Viaud
Text as Data: Real-time Measurement of Economic Welfare By Rickard Nyman; Paul Ormerod
TARGETING HUMANITARIAN AID USING ADMINISTRATIVE DATA: MODEL DESIGN AND VALIDATION By Onur Altindag; Stephen D. O’Connell; Aytug Sasmaz; Zeynep Balcioglu; Paola Cadoni; Matilda Jerneck; Aimee Kunze Foong
Improving Finite Sample Approximation by Central Limit Theorems for Estimates from Data Envelopment Analysis By Léopold Simar; Valentin Zelenyuk
Grouping of Contracts in Insurance using Neural Networks By Mark Kiermayer; Christian Wei{\ss}
The Logic of Strategic Assets: From Oil to Artificial Intelligence By Jeffrey Ding; Allan Dafoe
High-dimensional A-learning for optimal dynamic treatment regimes By Shi, Chengchun; Fan, Ailin; Song, Rui; Lu, Wenbin

By:	Joshua Angrist; Brigham Frandsen
Abstract:	Machine learning (ML) is mostly a predictive enterprise, while the questions of interest to labor economists are mostly causal. In pursuit of causal effects, however, ML may be useful for automated selection of ordinary least squares (OLS) control variables. We illustrate the utility of ML for regression-based causal inference by using lasso to select control variables for estimates of effects of college characteristics on wages. ML also seems relevant for an instrumental variables (IV) first stage, since the bias of two-stage least squares can be said to be due to over-fitting. Our investigation shows, however, that while ML-based instrument selection can improve on conventional 2SLS estimates, split-sample IV and LIML estimators do better. In some scenarios, the performance of ML-augmented IV estimators is degraded by pretest bias. In others, nonlinear ML for covariate control creates artificial exclusion restrictions that generate spurious findings. ML does better at choosing control variables for models identified by conditional independence assumptions than at choosing instrumental variables for models identified by exclusion restrictions.
JEL:	C21 C26 C52 C55 J01 J08
Date:	2019–12
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:26584&r=all

Market Efficiency in the Age of Big Data

By:	Ian Martin; Stefan Nagel
Abstract:	Modern investors face a high-dimensional prediction problem: thousands of observable variables are potentially relevant for forecasting. We reassess the conventional wisdom on market efficiency in light of this fact. In our model economy, which resembles a typical machine learning setting, N assets have cash flows that are a linear function of J firm characteristics, but with uncertain coefficients. Risk-neutral Bayesian investors impose shrinkage (ridge regression) or sparsity (Lasso) when they estimate the J coefficients of the model and use them to price assets. When J is comparable in size to N, returns appear cross-sectionally predictable using firm characteristics to an econometrician who analyzes data from the economy ex post. A factor zoo emerges even without p-hacking and data-mining. Standard in-sample tests of market efficiency reject the no-predictability null with high probability, despite the fact that investors optimally use the information available to them in real time. In contrast, out-of-sample tests retain their economic meaning.
JEL:	C11 G12 G14
Date:	2019–12
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:26586&r=all

The Productivity and Unemployment Effects of the Digital Transformation: an Empirical and Modelling Assessment

By:	Bertani, Filippo; Raberto, Marco; Teglio, Andrea
Abstract:	Since the last 30 years, the economy has been undergoing a massive digital transformation. Intangible digital assets, like software solutions, web services, and more recently deep learning algorithms, artificial intelligence and digital platforms, have been increasingly adopted thanks to the diffusion and advancements of information and communication technologies. Various observers argue that we could rapidly approach a technological singularity leading to explosive economic growth. The contribution of this paper is on the empirical and the modelling side. First, we present a cross-country empirical analysis assessing the correlation between intangible digital assets and different measures of productivity. Then we figure out their long-term impact on unemployment under different scenarios by means of an agent-based macro-model.
Keywords:	Intangible assets, Digital transformation, Total factor productivity, Technological unemployment, Agent-based economics
JEL:	C63
Date:	2020–01–20
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:98233&r=all

Improving Finite Sample Approximation by Central Limit Theorems for Estimates from Data Envelopment Analysis

By:	Ya Chen (Hefei University of Technology, China); Mike Tsionas (Lancaster University, United Kingdom); Valentin Zelenyuk (School of Economics and Centre for Efficiency and Productivity Analysis (CEPA) at The University of Queensland, Australia)
Abstract:	We propose an improvement of the finite sample approximation of the central limit theorems (CLTs) that were recently derived for statistics involving production efficiency scores estimated via Data Envelopment Analysis (DEA) or Free Disposal Hull (FDH) approaches. The improvement is very easy to implement since it involves a simple correction of the variance estimator with an estimate of the bias of the already employed statistics without any additional computational burden and preserves the original asymptotic results such as consistency and asymptotic normality. The proposed approach persistently showed improvement in all the scenarios that we tried in various Monte-Carlo experiments, especially for relatively small samples or relatively large dimensions (measured by total number of inputs and outputs) of the underlying production model. This approach therefore is expected to produce more accurate estimates of confidence intervals of aggregates of individual efficiency scores in empirical research using DEA or FDH approaches and so must be valuable for practitioners. We also illustrate this method using a popular real data set to confirm that the difference in the estimated confidence intervals can be substantial. A step-by-step implementation algorithm of the proposed approach is included in the Appendix.
Keywords:	DEA; sign-constrained convex nonparametric least squares (SCNLS); LASSO; elastic net; big data
Date:	2020–01
URL:	http://d.repec.org/n?u=RePEc:qld:uqcepa:145&r=all

An Artificial Intelligence approach to Shadow Rating

By:	Angela Rita Provenzano; Daniele Trifir\`o; Nicola Jean; Giacomo Le Pera; Maurizio Spadaccino; Luca Massaron; Claudio Nordio
Abstract:	We analyse the effectiveness of modern deep learning techniques in predicting credit ratings over a universe of thousands of global corporate entities obligations when compared to most popular, traditional machine-learning approaches such as linear models and tree-based classifiers. Our results show a adequate accuracy over different rating classes when applying categorical embeddings to artificial neural networks (ANN) architectures.
Date:	2019–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1912.09764&r=all

Credit growth, the yield curve and financial crisis prediction: evidence from a machine learning approach

By:	Bluwstein, Kristina (Bank of England); Buckmann, Marcus (Bank of England); Joseph, Andreas (Bank of England and King’s College London); Kang, Miao (Bank of England); Kapadia, Sujit (European Central Bank); Simsek, Özgür (University of Bath)
Abstract:	We develop early warning models for financial crisis prediction using machine learning techniques on macrofinancial data for 17 countries over 1870–2016. Machine learning models mostly outperform logistic regression in out-of-sample predictions and forecasting. We identify economic drivers of our machine learning models using a novel framework based on Shapley values, uncovering non-linear relationships between the predictors and crisis risk. Throughout, the most important predictors are credit growth and the slope of the yield curve, both domestically and globally. A flat or inverted yield curve is of most concern when nominal interest rates are low and credit growth is high.
Keywords:	Machine learning; financial crisis; financial stability; credit growth; yield curve; Shapley values; out-of-sample prediction
JEL:	C40 C53 E44 F30 G01
Date:	2020–01–03
URL:	http://d.repec.org/n?u=RePEc:boe:boeewp:0848&r=all

Design of High-Frequency Trading Algorithm Based on Machine Learning

By:	Boyue Fang; Yutong Feng
Abstract:	Based on iterative optimization and activation function in deep learning, we proposed a new analytical framework of high-frequency trading information, that reduced structural loss in the assembly of Volume-synchronized probability of Informed Trading ($VPIN$), Generalized Autoregressive Conditional Heteroscedasticity (GARCH) and Support Vector Machine (SVM) to make full use of the order book information. Amongst the return acquisition procedure in market-making transactions, uncovering the relationship between discrete dimensional data from the projection of high-dimensional time-series would significantly improve the model effect. $VPIN$ would prejudge market liquidity, and this effectiveness backtested with CSI300 futures return.
Date:	2019–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1912.10343&r=all

Economic policy uncertainty in the euro area: an unsupervised machine learning approach

By:	Azqueta-Gavaldon, Andres; Hirschbühl, Dominik; Onorante, Luca; Saiz, Lorena
Abstract:	We model economic policy uncertainty (EPU) in the four largest euro area countries by applying machine learning techniques to news articles. The unsupervised machine learning algorithm used makes it possible to retrieve the individual components of overall EPU endogenously for a wide range of languages. The uncertainty indices computed from January 2000 to May 2019 capture episodes of regulatory change, trade tensions and financial stress. In an evaluation exercise, we use a structural vector autoregression model to study the relationship between different sources of uncertainty and investment in machinery and equipment as a proxy for business investment. We document strong heterogeneity and asymmetries in the relationship between investment and uncertainty across and within countries. For example, while investment in France, Italy and Spain reacts strongly to political uncertainty shocks, in Germany investment is more sensitive to trade uncertainty shocks. JEL Classification: C80, D80, E22, E66, G18, G31
Keywords:	economic policy uncertainty, Europe, machine learning, textual-data
Date:	2020–01
URL:	http://d.repec.org/n?u=RePEc:ecb:ecbwps:20202359&r=all

China's First Workforce Skill Taxonomy

By:	Weipan Xu; Xiaozhen Qin; Xun Li; Haohui"Caron" Chen; Morgan Frank; Alex Rutherford; Andrew Reeson; Iyad Rahwan
Abstract:	China is the world's second largest economy. After four decades of economic miracles, China's economy is transitioning into an advanced, knowledge-based economy. Yet, we still lack a detailed understanding of the skills that underly the Chinese labor force, and the development and spatial distribution of these skills. For example, the US standardized skill taxonomy ONET played an important role in understanding the dynamics of manufacturing and knowledge-based work, as well as potential risks from automation and outsourcing. Here, we use Machine Learning techniques to bridge this gap, creating China's first workforce skill taxonomy, and map it to ONET. This enables us to reveal workforce skill polarization into social-cognitive skills and sensory-physical skills, and to explore the China's regional inequality in light of workforce skills, and compare it to traditional metrics such as education. We build an online tool for the public and policy makers to explore the skill taxonomy: skills.sysu.edu.cn. We will also make the taxonomy dataset publicly available for other researchers upon publication.
Date:	2020–01
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2001.02863&r=all

The Production of Information in an Online World

By:	Julia Cage (Département d'économie); Nicolas Hervé (Institut national de l'audiovisuel); Marie-Luce Viaud (Institut national de l'audiovisuel)
Abstract:	News production requires investment, and competitors’ ability to appropriate a story may reduce a media’s incentives to provide original content. Yet, there is little legal protection of intellectual property rights in online news production, which raises the issue of the extent of copying online and the incentives to provide original content. In this article, we build a unique dataset combining all the online content produced by French news media during the year 2013 with new micro audience data. We develop a topic detection algorithm that identifies each news event, trace the timeline of each story, and study news propagation. We provide new evidence on online news production. First, we document high reactivity of online media: one quarter of the news stories are reproduced online in under 4 min. We show that this is accompanied by substantial copying, both at the extensive and at the intensive margins, which may constitute a severe threat to the commercial viability of the news media. Next, we estimate the returns to originality in online news production. Using article-level variations and media-level daily audience combined with article-level social media statistics, we find that original content producers tend to receive more viewers, thereby mitigating the newsgathering incentive problem raised by copying.
Keywords:	Internet; Information Sharing; Copyright; Social Media; Reputation
JEL:	L11 L15 L82 L86
Date:	2019–12
URL:	http://d.repec.org/n?u=RePEc:spo:wpmain:info:hdl:2441/52cps7rdns8iv8fr3f1kqm7iuv&r=all

Text as Data: Real-time Measurement of Economic Welfare

By:	Rickard Nyman; Paul Ormerod
Abstract:	Economists are showing increasing interest in the use of text as an input to economic research. Here, we analyse online text to construct a real time metric of welfare. For purposes of description, we call it the Feel Good Factor (FGF). The particular example used to illustrate the concept is confined to data from the London area, but the methodology is readily generalisable to other geographical areas. The FGF illustrates the use of online data to create a measure of welfare which is not based, as GDP is, on value added in a market-oriented economy. There is already a large literature which measures wellbeing/happiness. But this relies on conventional survey approaches, and hence on the stated preferences of respondents. In unstructured online media text, users reveal their emotions in ways analogous to the principle of revealed preference in consumer demand theory. The analysis of online media offers further advantages over conventional survey-based measures of sentiment or well-being. It can be carried out in real time rather than with the lags which are involved in survey approaches. In addition, it is very much cheaper.
Date:	2020–01
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2001.03401&r=all

TARGETING HUMANITARIAN AID USING ADMINISTRATIVE DATA: MODEL DESIGN AND VALIDATION

By:	Onur Altindag (Bentley University Department of Economics and Economic Research Forum); Stephen D. O’Connell (Emory University Department of Economics and IZA Institute of Labor Economics); Aytug Sasmaz (Harvard University Department of Government); Zeynep Balcioglu (Northeastern University Department of Political Science); Paola Cadoni (UNHCR Lebanon); Matilda Jerneck (UNHCR Lebanon); Aimee Kunze Foong (UNHCR Lebanon)
Abstract:	We develop and assess the performance of an econometric targeting model for a large scale humanitarian aid program providing unconditional cash and food assistance to refugees in Lebanon. We use regularized linear regression to derive a prediction model for household expenditure based on demographic and background characteristics; from administrative data that are routinely collected by humanitarian agencies. Standard metrics of prediction accuracy suggest this approach compares favorably to the commonly used “scorecard” Proxy Means Test, which requires a survey of the entire target population. We confirm these results through a blind validation test performed on a random sample collected after the model derivation.
Date:	2019–09–20
URL:	http://d.repec.org/n?u=RePEc:erg:wpaper:1343&r=all

Improving Finite Sample Approximation by Central Limit Theorems for Estimates from Data Envelopment Analysis

By:	Léopold Simar (Institut de Statistique, Biostatistique et Sciences Actuarielles, Université Catholique de Louvain, B1348 Louvain-la-Neuve, Belgium); Valentin Zelenyuk (School of Economics and Centre for Efficiency and Productivity Analysis (CEPA) at The University of Queensland, Australia)
Abstract:	We propose an improvement of the finite sample approximation of the central limit theorems (CLTs) that were recently derived for statistics involving production efficiency scores estimated via Data Envelopment Analysis (DEA) or Free Disposal Hull (FDH) approaches. The improvement is very easy to implement since it involves a simple correction of the variance estimator with an estimate of the bias of the already employed statistics without any additional computational burden and preserves the original asymptotic results such as consistency and asymptotic normality. The proposed approach persistently showed improvement in all the scenarios that we tried in various Monte-Carlo experiments, especially for relatively small samples or relatively large dimensions (measured by total number of inputs and outputs) of the underlying production model. This approach therefore is expected to produce more accurate estimates of confidence intervals of aggregates of individual efficiency scores in empirical research using DEA or FDH approaches and so must be valuable for practitioners. We also illustrate this method using a popular real data set to confirm that the difference in the estimated confidence intervals can be substantial. A step-by-step implementation algorithm of the proposed approach is included in the Appendix.
Keywords:	Data Envelopment Analysis, DEA; Free Disposal Hull, FDH; Statistical Inference; Production Efficiency; Productivity.
Date:	2020–01
URL:	http://d.repec.org/n?u=RePEc:qld:uqcepa:144&r=all

Grouping of Contracts in Insurance using Neural Networks

By:	Mark Kiermayer; Christian Wei{\ss}
Abstract:	Despite the high importance of grouping in practice, there exists little research on the respective topic. The present work presents a complete framework for grouping and a novel method to optimize model points. Model points are used to substitute clusters of contracts in an insurance portfolio and thus yield a smaller, computationally less burdensome portfolio. This grouped portfolio is controlled to have similar characteristics as the original portfolio. We provide numerical results for term life insurance and defined contribution plans, which indicate the superiority of our approach compared to K-means clustering, a common baseline algorithm for grouping. Lastly, we show that the presented concept can optimize a fixed number of model points for the entire portfolio simultaneously. This eliminates the need for any pre-clustering of the portfolio, e.g. by K-means clustering, and therefore presents our method as an entirely new and independent methodology.
Date:	2019–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1912.09964&r=all

The Logic of Strategic Assets: From Oil to Artificial Intelligence

By:	Jeffrey Ding; Allan Dafoe
Abstract:	What resources and technologies are strategic? This question is often the focus of policy and theoretical debates, where the label "strategic" designates those assets that warrant the attention of the highest levels of the state. But these conversations are plagued by analytical confusion, flawed heuristics, and the rhetorical use of "strategic" to advance particular agendas. We aim to improve these conversations through conceptual clarification, introducing a theory based on important rivalrous externalities for which socially optimal behavior will not be produced alone by markets or individual national security entities. We distill and theorize the most important three forms of these externalities, which involve cumulative-, infrastructure-, and dependency-strategic logics. We then employ these logics to clarify three important cases: the Avon 2 engine in the 1950s, the U.S.-Japan technology rivalry in the late 1980s, and contemporary conversations about artificial intelligence.
Date:	2020–01
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2001.03246&r=all

High-dimensional A-learning for optimal dynamic treatment regimes

By:	Shi, Chengchun; Fan, Ailin; Song, Rui; Lu, Wenbin
Abstract:	Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many complex diseases, such as cancer, treatment decisions need to be tailored over time according to patients' responses to previous treatments. Such an adaptive strategy is referred as a dynamic treatment regime. A major challenge in deriving an optimal dynamic treatment regime arises when an extraordinary large number of prognostic factors, such as patient's genetic information, demographic characteristics, medical history and clinical measurements over time are available, but not all of them are necessary for making treatment decision. This makes variable selection an emerging need in precision medicine. In this paper, we propose a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the nonpolynomial (NP) order of the sample size. To preserve the double robustness property of the A-learning method, we adopt the Dantzig selector, which directly penalizes the A-leaning estimating equations. Oracle inequalities of the proposed estimators for the parameters in the optimal dynamic treatment regime and error bounds on the difference between the value functions of the estimated optimal dynamic treatment regime and the true optimal dynamic treatment regime are established. Empirical performance of the proposed approach is evaluated by simulations and illustrated with an application to data from the STAR∗D study.
Keywords:	A-learning; Dantzig selector; model misspecification; NP-dimensionality; optimal dynamic treatment regime; oracle inequality
JEL:	C1
Date:	2018–06–01
URL:	http://d.repec.org/n?u=RePEc:ehl:lserod:102113&r=all

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.