nep-big 2018-01-22 papers

on Big Data

Issue of 2018‒01‒22
thirteen papers chosen by
Tom Coupé
University of Canterbury

Double/debiased machine learning for treatment and structural parameters By Victor Chernozhukov; Denis Chetverikov; Mert Demirer; Esther Duflo; Christian Hansen; Whitney K. Newey; James Robins
Hedonic Recommendations: An Econometric Application on Big Data By Okay Gunes
PrivySense: $\underline{Pri}$ce $\underline{V}$olatilit$\underline{y}$ based $\underline{Sen}$timent$\underline{s}$ $\underline{E}$stimation from Financial News using Machine Learning By Raeid Saqur; Nicole Langballe
Hospital Readmission is Highly Predictable from Deep Learning By Damien Échevin; Qing Li; Marc-André Morin
How Well Do Automated Methods Perform in Historical Samples? Evidence from New Ground Truth By Martha Bailey; Connor Cole; Morgan Henderson; Catherine Massey
Macroeconomic Indicator Forecasting with Deep Neural Networks By Cook, Thomas R.; Smalter Hall, Aaron
News and narratives in financial systems: exploiting big data for systemic risk assessment By Nyman, Rickard; Kapadia, Sujit; Tuckett, David; Gregory, David; Ormerod, Paul; Smith, Robert
Monitoring Banking System Fragility with Big Data By Hale, Galina; Lopez, Jose A.
Predict Forex Trend via Convolutional Neural Networks By Yun-Cheng Tsai; Jun-Hao Chen; Jun-Jie Wang
Estimation and Inference of Treatment Effects with $L_2$-Boosting in High-Dimensional Settings By Ye Luo; Martin Spindler
Consistent Pseudo-Maximum Likelihood Estimators By Christian Gouriéroux; Alain Monfort; Eric Renault
Some Large Sample Results for the Method of Regularized Estimators By Michael Jansson; Demian Pouzo
L'analyse lexicale au service de la cliodynamique : traitement par intelligence artificielle de la base Google Ngram By Jérôme Baray; Albert Da Silva; Jean-Marc Leblanc

Double/debiased machine learning for treatment and structural parameters

By:	Victor Chernozhukov (Institute for Fiscal Studies and MIT); Denis Chetverikov (Institute for Fiscal Studies and UCLA); Mert Demirer (Institute for Fiscal Studies); Esther Duflo (Institute for Fiscal Studies); Christian Hansen (Institute for Fiscal Studies and Chicago GSB); Whitney K. Newey (Institute for Fiscal Studies and MIT); James Robins (Institute for Fiscal Studies)
Abstract:	We revisit the classic semiparametric problem of inference on a low dimensional parameter ?0 in the presence of high-dimensional nuisance parameters ?0. We depart from the classical setting by allowing for ?0 to be so high-dimensional that the traditional assumptions, such as Donsker properties, that limit complexity of the parameter space for this object break down. To estimate ?0, we consider the use of statistical or machine learning (ML) methods which are particularly well-suited to estimation in modern, very high-dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating ?0 cause a heavy bias in estimators of ?0 that are obtained by naively plugging ML estimators of ?0 into estimating equations for ?0. This bias results in the naive estimator failing to be N -1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest ?0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman-orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate ?0, and (2) making use of cross-fitting which provides an efficient form of data-splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in a N -1/2-neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of DML applied to learn the main regression parameter in a partially linear regression model, DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model, DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, and DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples.
Date:	2017–06–02
URL:	http://d.repec.org/n?u=RePEc:ifs:cemmap:28/17&r=big

Hedonic Recommendations: An Econometric Application on Big Data

By:	Okay Gunes (CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique)
Abstract:	This work will demonstrate how economic theory can be applied to big data analysis. To do this, I propose two layers of machine learning that use econometric models introduced into a recommender system. The reason for doing so is to challenge traditional recommendation approaches. These approaches are inherently biased due to the fact that they ignore the final preference order for each individual and under-specify the interaction between the socio-economic characteristics of the participants and the characteristics of the commodities in question. In this respect, our hedonic recommendation approach proposes to first correct the internal preferences with respect to the tastes of each individual under the characteristics of given products. In the second layer, the relative preferences across participants are predicted by socio-economic characteristics. The robustness of the model is tested with the MovieLens (100k data consists of 943 users over 1682 movies) run by GroupLens. Our methodology shows the importance and the necessity of correcting the data set by using economic theory. This methodology can be applied for all recommender systems using ratings based on consumer decisions.
Abstract:	Ce travail démontre comment la théorie économique peut être appliquée à l'analyse de Big Data. On propose deux couches d'apprentissage automatique qui utilisent des modèles économétriques introduits dans un système de recommandation. La raison de le faire est de remettre en question les approches de recommandation traditionnelles. Ces approches sont intrinsèquement biaisées en raison du fait qu'elles ignorent l'ordre de préférence final pour chaque individu et sous-spécifient l'interaction entre les caractéristiques socio-économiques des participants et les caractéristiques des produits en question. A cet égard, notre approche de recommandation hédonique propose de corriger d'abord les préférences internes par rapport aux go&ucric;ts de chaque individu en fonction des caractéristiques des produits donnés. Dans la deuxième couche, les préférences relatives entre les participants sont prédites par les caractéristiques socio-économiques. La robustesse du modèle est testée avec les MovieLens (100k données se composent de 943 utilisateurs sur 1682 films) gérés par GroupLens. Notre méthodologie montre l'importance et la nécessité de corriger l'ensemble de données en utilisant la théorie économique. Cette méthodologie peut être appliquée à tous les systèmes de recommandation qui utilisent des votes basées sur les décisions.
Keywords:	Big Data,Machine learning,Recommendation Engine,Econometrics,Données massives,Python,R,Apprentissage automatique,Système recommandation,Econométrie
Date:	2017–12
URL:	http://d.repec.org/n?u=RePEc:hal:cesptp:halshs-01673355&r=big

PrivySense: $\underline{Pri}$ce $\underline{V}$olatilit$\underline{y}$ based $\underline{Sen}$timent$\underline{s}$ $\underline{E}$stimation from Financial News using Machine Learning

By:	Raeid Saqur; Nicole Langballe
Abstract:	As machine learning ascends the peak of computer science zeitgeist, the usage and experimentation with sentiment analysis using various forms of textual data seems pervasive. The effect is especially pronounced in formulating securities trading strategies, due to a plethora of reasons including the relative ease of implementation and the abundance of academic research suggesting automated sentiment analysis can be productively used in trading strategies. The source data for such analyzers ranges a broad spectrum like social media feeds, micro-blogs, real-time news feeds, ex-post financial data etc. The abstract technique underlying these analyzers involve supervised learning of sentiment classification where the classifier is trained on annotated source corpus, and accuracy is measured by testing how well the classifiers generalizes on unseen test data from the corpus. Post training, and validation of fitted models, the classifiers are used to execute trading strategies, and the corresponding returns are compared with appropriate benchmark returns (for e.g., the S&P500 returns). In this paper, we introduce $\underline{a\ novel\ technique\ of\ using\ price\ volatilities\ to\ empirically\ determine\ the\ sentiment\ in\ news\ data}$, instead of the traditional reverse approach. We also perform meta sentiment analysis by evaluating the efficacy of existing sentiment classifiers and the precise definition of sentiment from securities trading context. We scrutinize the efficacy of using human-annotated sentiment classification and the tacit assumptions that introduces subjective bias in existing financial news sentiment classifiers.
Date:	2017–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1801.00091&r=big

Hospital Readmission is Highly Predictable from Deep Learning

By:	Damien Échevin; Qing Li; Marc-André Morin
Abstract:	Hospital readmission is costly and existing models are often poor or moderate in predicting readmission. We sought to develop and test a method that can be applied generally by hospitals. Such a tool can help clinicians identify patients who are more likely to be readmitted, either at early stages of hospital stay or at hospital discharge. Relying on state-of-the art machine learning algorithms, we predict probability of 30-day readmission at hospital admission and at hospital discharge using administrative data on 1,633,099 hospital stays from Quebec between 1995 and 2012. We measure performance of the predictions with the area under receiver operating characteristic curve (AUC). Deep Learning produced excellent prediction of readmission province-wide, and Random Forest reached very similar level. The AUC for these two algorithms reached above 78% at hospital admission and above 87% at hospital discharge, and the diagnostic codes are among the most predictive variables. The ease of implementation of machine learning algorithms, together with objectively validated reliability, brings new possibilities for cost reduction in the health care system.
Keywords:	Machine learning; Logistic regression; Risk of re-hospitalisation; Healthcare costs
JEL:	I10 C52
Date:	2017
URL:	http://d.repec.org/n?u=RePEc:lvl:criacr:1701&r=big

How Well Do Automated Methods Perform in Historical Samples? Evidence from New Ground Truth

By:	Martha Bailey; Connor Cole; Morgan Henderson; Catherine Massey
Abstract:	New large-scale data linking projects are revolutionizing empirical social science. Outside of selected samples and tightly restricted data enclaves, little is known about the quality of these “big data” or how the methods used to create them shape inferences. This paper evaluates the performance of commonly used automated record-linking algorithms in three high quality historical U.S. samples. Our findings show that (1) no method (including hand linking) consistently produces samples representative of the linkable population; (2) automated linking tends to produce very high rates of false matches, averaging around one third of links across datasets and methods; and (3) false links are systematically (though differently) related to baseline sample characteristics. A final exercise demonstrates the importance of these findings for inferences using linked data. For a common set of records, we show that algorithm assumptions can attenuate estimates of intergenerational income elasticities by almost 50 percent. Although differences in these findings across samples and methods caution against the generalizability of specific error rates, common patterns across multiple datasets offer broad lessons for improving current linking practice.
JEL:	J62 N0
Date:	2017–11
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:24019&r=big

Macroeconomic Indicator Forecasting with Deep Neural Networks

By:	Cook, Thomas R. (Federal Reserve Bank of Kansas City); Smalter Hall, Aaron (Federal Reserve Bank of Kansas City)
Abstract:	Economic policymaking relies upon accurate forecasts of economic conditions. Current methods for unconditional forecasting are dominated by inherently linear models {{p}} that exhibit model dependence and have high data demands. {{p}} We explore deep neural networks as an {{p}} opportunity to improve upon forecast accuracy with limited data and while remaining agnostic as to {{p}} functional form. We focus on predicting civilian unemployment using models based on four different neural network architectures. Each of these models outperforms bench- mark models at short time horizons. One model, based on an Encoder Decoder architecture outperforms benchmark models at every forecast horizon (up to four quarters).
Keywords:	Neural networks; Forecasting; Macroeconomic indicators
JEL:	C14 C45 C53
Date:	2017–09–29
URL:	http://d.repec.org/n?u=RePEc:fip:fedkrw:rwp17-11&r=big

News and narratives in financial systems: exploiting big data for systemic risk assessment

By:	Nyman, Rickard (University College London, Centre for the Study of Decision-Making Uncertainty); Kapadia, Sujit (Bank of England); Tuckett, David (University College London, Centre for the Study of Decision-Making Uncertainty); Gregory, David (Bank of England); Ormerod, Paul (University College London, Centre for the Study of Decision-Making Uncertainty); Smith, Robert (University College London, Centre for the Study of Decision-Making Uncertainty)
Abstract:	This paper applies algorithmic analysis to large amounts of financial market text-based data to assess how narratives and sentiment play a role in driving developments in the financial system. We find that changes in the emotional content in market narratives are highly correlated across data sources. They show clearly the formation (and subsequent collapse) of very high levels of sentiment — high excitement relative to anxiety — prior to the global financial crisis. Our metrics also have predictive power for other commonly used measures of sentiment and volatility and appear to influence economic and financial variables. And we develop a new methodology that attempts to capture the emergence of narrative topic consensus which gives an intuitive representation of increasing homogeneity of beliefs prior to the crisis. With increasing consensus around narratives high in excitement and lacking anxiety likely to be an important warning sign of impending financial system distress, the quantitative metrics we develop may complement other indicators and analysis in helping to gauge systemic risk.
Keywords:	Systemic risk; text mining; big data; sentiment; uncertainty; narratives; forecasting; early warning indicators
JEL:	C53 D83 E32 G01 G17
Date:	2018–01–05
URL:	http://d.repec.org/n?u=RePEc:boe:boeewp:0704&r=big

Monitoring Banking System Fragility with Big Data

By:	Hale, Galina (Federal Reserve Bank of San Francisco); Lopez, Jose A. (Federal Reserve Bank of San Francisco)
Abstract:	The need to monitor aggregate financial stability was made clear during the global financial crisis of 2008-2009, and, of course, the need to monitor individual financial firms from a microprudential standpoint remains. In this paper, we propose a procedure based on mixed-frequency models and network analysis to help address both of these policy concerns. We decompose firm-specific stock returns into two components: one that is explained by observed covariates (or fitted values), the other unexplained (or residuals). We construct networks based on the co-movement of these components. Analysis of these networks allows us to identify time periods of increased risk concentration in the banking sector and determine which firms pose high systemic risk. Our results illustrate the efficacy of such modeling techniques for monitoring and potentially enhancing national financial stability.
JEL:	C32 G21 G28
Date:	2017–09–15
URL:	http://d.repec.org/n?u=RePEc:fip:fedfwp:2018-01&r=big

Predict Forex Trend via Convolutional Neural Networks

By:	Yun-Cheng Tsai; Jun-Hao Chen; Jun-Jie Wang
Abstract:	Deep learning is an effective approach to solving image recognition problems. People draw intuitive conclusions from trading charts; this study uses the characteristics of deep learning to train computers in imitating this kind of intuition in the context of trading charts. The three steps involved are as follows: 1. Before training, we pre-process the input data from quantitative data to images. 2. We use a convolutional neural network (CNN), a type of deep learning, to train our trading model. 3. We evaluate the model's performance in terms of the accuracy of classification. A trading model is obtained with this approach to help devise trading strategies. The main application is designed to help clients automatically obtain personalized trading strategies.
Date:	2018–01
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1801.03018&r=big

Estimation and Inference of Treatment Effects with $L_2$-Boosting in High-Dimensional Settings

By:	Ye Luo; Martin Spindler
Abstract:	Boosting algorithms are very popular in Machine Learning and have proven very useful for prediction and variable selection. Nevertheless in many applications the researcher is interested in inference on treatment effects or policy variables in a high-dimensional setting. Empirical researchers are more and more faced with rich datasets containing very many controls or instrumental variables, where variable selection is challenging. In this paper we give results for the valid inference of a treatment effect after selecting from among very many control variables and the estimation of instrumental variables with potentially very many instruments when post- or orthogonal $L_2$-Boosting is used for the variable selection. This setting allows for valid inference on low-dimensional components in a regression estimated with $L_2$-Boosting. We give simulation results for the proposed methods and an empirical application, in which we analyze the effectiveness of a pulmonary artery catheter.
Date:	2017–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1801.00364&r=big

Consistent Pseudo-Maximum Likelihood Estimators

By:	Christian Gouriéroux (CREST; University of Toronto); Alain Monfort (CREST); Eric Renault (Brown university)
Abstract:	The development of the literature on the pseudo maximum likelihood (PML) estimators would not have been so efficient without the modern proof of consistency of extremum estimators introduced at the end of the sixties by E. Malinvaud and R. Jennrich. We discuss this proof and replace it in an historical perspective. In this paper we also provide a survey of the literature on consistent (PML) estimators. We emphasize the role of the white noise assumptions on the set of pseudo distributions leading to consistent estimators. The stronger these assumptions, the larger the set of consistent PML estimators. We also illustrate the importance of these PML approaches in big data environment.
Keywords:	Pseudo-Likelihood, Composite Pseudo-Likelihood, Consistency, Big Data, ARCH Model, Normalized Data, Lie Group
URL:	http://d.repec.org/n?u=RePEc:crs:wpaper:2017-10&r=big

Some Large Sample Results for the Method of Regularized Estimators

By:	Michael Jansson; Demian Pouzo
Abstract:	We present a general framework for studying regularized estimators; i.e., estimation problems wherein "plug-in" type estimators are either ill-defined or ill-behaved. We derive primitive conditions that imply consistency and asymptotic linear representation for regularized estimators, allowing for slower than $\sqrt{n}$ estimators as well as infinite dimensional parameters. We also provide data-driven methods for choosing tuning parameters that, under some conditions, achieve the aforementioned results. We illustrate the scope of our approach by studying a wide range of applications, revisiting known results and deriving new ones.
Date:	2017–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1712.07248&r=big

L'analyse lexicale au service de la cliodynamique : traitement par intelligence artificielle de la base Google Ngram

By:	Jérôme Baray (IRG - Institut de Recherche en Gestion - UPEM - Université Paris-Est Marne-la-Vallée - UPEC UP12 - Université Paris-Est Créteil Val-de-Marne - Paris 12); Albert Da Silva; Jean-Marc Leblanc
Abstract:	Cliodynamics is a fairly recent research field that considers history as an object of scientific study. Thanks to its transdisciplinary nature, cliodynamics tries to explain historical dynamical processes such as the rise or collapse of empires or civilizations, economic cycles, population booms, fashions through mathematical modeling, datamining, econometrics or cultural sociology. "Big data" aggregating historical, archaeological or economic informations is the material to feed these quantitative models. It can also incluse empirical analysis to validate assumptions and predictions of dynamic models using historical data. Cliodynamics is part of the cliometrics approach or "new economic history" which studies history through econometrics. Objectives On the one hand, we designed a robust lexical analysis method able to deal with a very large dated corpus series whose content evolves over time (big data) with the challenge of identifying societal evolutions and major historical periods in a cliodynamics perspective. Lexical analysis also examined the teachings to be learned from the Google books Ngram database, which details the number of annual words occurrences in scanned publications available in the Google Books search engine . It is assumed that this database has compiled about 20% of all books ever published in major languages. We focused our study on English-language books published in the United States and Great Britain. The objective was to identify the words frequencies evolving from year 1860 to 2008. Method Principles The method was to constitute, as a first step, a dictionary of the most commonly used English words, disregarding two-way terms, preposition, articles, pronouns. This dictionary has collected 1592 words covering many aspects of social and cultural life with terms related to politics, religion, arts and sciences, industry, objects, family and sentiments. In a second step, the percentage representation of each word in the dictionary was determined for each year after loading the huge Ngram Google Books (1-gram) database on Postgresql. Some words like "king" or "queen" are very well represented in the 19th century dictionary with the reign and power of royalties in Europe, but the use of these phrases declined in the 20th century. The words frequency in books is constantly evolving as time goes by. The third step was to perform a centered and standardized principal component analysis (PCA) on the table describing the representation of words in % by years from 1860 to 2008. A clustering of "years" is carried out using a neural network (artificial intelligence Kohonen map). The results show 8 different periods in history according to 3 different major tendancies in speeches : Humanist versus Scientific ; Chaos versus Organization ; Individualist versus Collectivist.
Abstract:	La cliodynamique est un domaine de recherche assez récent qui considère l’histoire comme un objet d’étude scientifique. De nature transdisciplinaire, la cliodynamique tente ainsi d’expliquer les processus dynamiques historiques comme la montée ou l’effondrement des empires ou civilisations, les cycles économiques, les booms de population, les modes grâce à la modélisation mathématique, le datamining, l’économétrie ou encore la sociologie culturelle. Les big data agrégeant des données historiques, archéologiques ou économiques sont la matière permettant d’alimenter ces modèles quantitatifs. La cliodynamique comporte également un volet empirique examinant l’adéquation des hypothèses et des prévisions de modèles dynamiques avec les données historiques. Elle s’inscrit dans la démarche de la cliométrie ou « new economic history » qui étudie l’histoire grâce à des méthodes tirées de l’économétrie. Objectifs Il s’agissait d’une part de concevoir une méthode robuste d’analyse lexicale capable de traiter une série de très gros corpus datés dont le contenu évolue à travers le temps (big data) avec l’enjeu de cerner les évolutions sociétales et les grandes périodes historiques clés dans le cadre de la cliodynamique. L'analyse lexicale s’est d’autre part penchée sur les enseignements à tirer de la base de données Google books Ngram qui détaille le nombre d'occurrences des mots utilisés année après année dans les publications scannées et intégrées au moteur de recherche Google Books. On considère que cette base a compilé environ 20% des livres publiés dans les langues majeures. Nous nous sommes focalisés sur les ouvrages en langue anglaise publiés aux Etats-Unis et en Grande Bretagne. L'objectif a été de cerner l'utilisation plus ou moins forte de certains mots selon les époques, la période d'étude ayant été fixée de 1860 à 2008. Principe de la méthode La méthode a été de constituer dans un premier temps un dictionnaire des mots anglais les plus usités en faisant abstraction des termes à double sens, préposition, articles, pronoms. Ce dictionnaire a dans cette version initiale rassemblé 1592 mots couvrant de nombreux aspects de la vie sociale et culturelle avec des termes liés à la politique, à la religion, aux arts et aux sciences, à l'industrie, aux objets, à la famille et aux sentiments. Dans un second temps, il a été déterminé la représentation en % de chacun de ses mots au sein du dictionnaire année après année après avoir mis l'imposante base Ngram Google Books (1-gram) sur Postgresql. Certains mots comme "king" ou "queen" sont très bien représentés dans le dictionnaire au 19ème siècle avec le règne et la puissance des pouvoirs royaux en Europe, mais l'usage de ces locutions décline au 20ème siècle. On constate donc une évolution constante de la fréquence des mots dans les ouvrages au fur et à mesure des époques. La troisième étape a été d'effectuer une analyse factorielle en composantes principales centrée et normée sur le tableau décrivant la représentation des mots en % selon les années de 1860 à 2008. Une classification des années en groupes est réalisée grâce à un réseau de neurones (carte auto-adaptative de Kohonen en IA). Les résultats montrent 8 périodes différentes dans l'histoire selon 3 tendances différentes dans les discours : humaniste versus scientifique; chaos versus organisation; individualiste versus collectiviste.
Keywords:	cliodynamique,analyse lexicale,Google Ngram,intelligence artificielle,big data
Date:	2017–11–24
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-01648487&r=big

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.