nep-big New Economics Papers
on Big Data
Issue of 2017‒11‒12
thirteen papers chosen by
Tom Coupé
University of Canterbury

  1. Big Data; Potential, Challenges and Statistical Implications By Cornelia Hammer; Diane C Kostroch; Gabriel Quiros
  2. Text Mining-based Economic Activity Estimates By Ksenia Yakovleva
  3. Sending firm messages: text mining letters from PRA supervisors to banks and building societies they regulate By Bholat, David; Brookes, James; Cai, Chris; Grundy, Katy; Lund, Jakob
  4. What's the Story? A New Perspective on the Value of Economic Forecasts By Steven A. Sharpe; Nitish R. Sinha; Christopher A. Hollrah
  5. Big Data Measures of Well-Being: Evidence from a Google Well-Being Index in the United States By Algan, Yann; Beasley, Elizabeth; Guyot, Florian; Higa, Kazuhito; Murtin, Fabrice; Senik, Claudia
  6. Colonial Legacies: Shaping African Cities By Neeraj Baruah; J. Vernon Henderson; Cong Peng
  7. Nighttime Lights as a Proxy for Human Development at the Local Level By Anna Bruederle; Roland Hodler
  8. Healthy Immigrant Effect or Over-Medicalization of Pregnancy? Evidence from Birth Certificates By Bertoli, P.; Grembi, V.; Kazakis, P.;
  9. Forecasting the Success Rate of Reward Based Crowdfunding Projects By Elenchev, Ivelin; Vasilev, Aleksandar
  10. Generalized Random Forests By Athey, Susan; Tibshirani, Julie; Wager, Stefan
  11. Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests By Wager, Stefan; Athey, Susan
  12. Global Optimization issues in Supervised Learning. An overview By Laura Palagi
  13. Forecasting the Success Rate of Reward Based Crowdfunding Projects By Ivelin Elenchev; Aleksandar Vasilev

  1. By: Cornelia Hammer; Diane C Kostroch; Gabriel Quiros
    Abstract: Big data are part of a paradigm shift that is significantly transforming statistical agencies, processes, and data analysis. While administrative and satellite data are already well established, the statistical community is now experimenting with structured and unstructured human-sourced, process-mediated, and machine-generated big data. The proposed SDN sets out a typology of big data for statistics and highlights that opportunities to exploit big data for official statistics will vary across countries and statistical domains. To illustrate the former, examples from a diverse set of countries are presented. To provide a balanced assessment on big data, the proposed SDN also discusses the key challenges that come with proprietary data from the private sector with regard to accessibility, representativeness, and sustainability. It concludes by discussing the implications for the statistical community going forward.
    Keywords: Surveillance;Big Data, Macroeconomic and Financial Statistics, Official Statistics, Data Quality, General, Institutions and the Macroeconomy, Introductory Material
    Date: 2017–09–13
    URL: http://d.repec.org/n?u=RePEc:imf:imfsdn:17/06&r=big
  2. By: Ksenia Yakovleva (Bank of Russia, Russian Federation)
    Abstract: This paper outlines the methodology for calculating a high-frequency indicator of economic activity in Russia. News articles taken from Internet resources are used as data sources. The news articles are analysed using text mining and machine learning methods, which, although developed relatively recently, have quickly found wide application in scientific research, including economic studies. This is because news is not only a key source of information but a way to gauge the sentiment of journalists and survey respondents about the current situation and convert it into quantitative data.
    Keywords: economic activity estimates, text mining, machine learning.
    JEL: E37
    Date: 2017–10
    URL: http://d.repec.org/n?u=RePEc:bkr:wpaper:wps25&r=big
  3. By: Bholat, David (Bank of England); Brookes, James (Bank of England); Cai, Chris (Bank of England); Grundy, Katy (Bank of England); Lund, Jakob (Bank of England)
    Abstract: Our paper analyses confidential letters sent from the Bank of England’s Prudential Regulation Authority (PRA) to banks and building societies it supervises. These letters are a ‘report card’ written to firms annually, and are arguably the most important, regularly recurring written communication sent from the PRA to firms it supervises. Using a mix of methods, including a machine learning algorithm called random forests, we explore whether the letters vary depending on the riskiness of the firm to whom the PRA is writing. We find that they do. We also look across the letters as a whole to draw out key topical trends and confirm that topics important on the post-crisis regulatory agenda such as liquidity and resolution appear frequently. And we look at how PRA letters differ from the letters written by the PRA’s predecessor, the Financial Services Authority. We find evidence that PRA letters are different, with a greater abundance of forward-looking language and directiveness, reflecting the shift in supervisory approach that has occurred in the United Kingdom following the financial crisis of 2007–09.
    Keywords: Bank of England Prudential Regulation Authority; banking supervision; text mining; machine learning; random forests; Financial Services Authority; central bank communications
    JEL: C80 E58 G28
    Date: 2017–10–27
    URL: http://d.repec.org/n?u=RePEc:boe:boeewp:0688&r=big
  4. By: Steven A. Sharpe; Nitish R. Sinha; Christopher A. Hollrah
    Abstract: We apply textual analysis tools to measure the degree of optimism versus pessimism of the text that describes Federal Reserve Board forecasts published in the Greenbook. We then examine whether this measure of sentiment, or Greenbook text "Tonality", has incremental power for predicting the economy, specifically, unemployment, GDP growth, and inflation up to four quarters ahead; we also test whether Tonality helps predict monetary policy and stock returns. Tonality is found to have significant and substantive directional predictive power for the GDP growth and the change in unemployment over the subsequent four-quarter horizon, particularly since 1990. Higher (more optimistic) Tonality presages higher than forecast GDP growth and lower unemployment. Higher Tonality is also found to help predict tighter monetary policy up to four quarters ahead. Finally, we find that Tonality has substantial positive and significant power for predicting 3-month-ahead and 6-month ahea d stock market returns.
    Keywords: Economic Forecasts ; Monetary policy ; Text Analysis
    JEL: C53 E17 E27 E37 E52
    Date: 2017–11–03
    URL: http://d.repec.org/n?u=RePEc:fip:fedgfe:2017-107&r=big
  5. By: Algan, Yann; Beasley, Elizabeth; Guyot, Florian; Higa, Kazuhito; Murtin, Fabrice; Senik, Claudia
    Abstract: We build an indicator of individual well-being in the United States based on Google Trends. The indicator is a combination of keyword groups that are endogenously identified to fit with weekly time-series of subjective wellbeing measures collected by Gallup Analytics. We find that keywords associated with job search, financial security, family life and leisure are the strongest predictors of the variations in subjective wellbeing. The model successfully predicts the out-of-sample evolution of most subjective wellbeing measures at a one-year horizon.
    Keywords: Subjective Well-Being; Big Data; Bayesian Statistics
    Date: 2016–06
    URL: http://d.repec.org/n?u=RePEc:cpm:docweb:1605&r=big
  6. By: Neeraj Baruah; J. Vernon Henderson; Cong Peng
    Abstract: Differential institutions imposed during colonial rule continue to affect the spatial structure and urban interactions in African cities. Based on a sample of 318 cities across 28 countries using satellite data on built cover over time, Anglophone origin cities sprawl compared to Francophone ones. Anglophone cities have less intense land use and more irregular layout in the older colonial portions of cities, and more leapfrog development at the extensive margin. Results are impervious to a border experiment, many robustness tests, measures of sprawl, and sub-samples. Why would colonial origins matter? The British operated under indirect rule and a dual mandate within cities, allowing colonial and native sections to develop without an overall plan and coordination. In contrast, integrated city planning and land allocation mechanisms were a feature of French colonial rule, which was inclined to direct rule. The results also have public policy relevance. From the Demographic and Health Survey, similar households which are located in areas of the city with more leapfrog development have poorer connections to piped water, electricity, and landlines, presumably because of higher costs of providing infrastructure with urban sprawl.
    Keywords: colonialism, persistence, Africa, sprawl, urban form, urban planning, leapfrog
    JEL: H7 N97 O1 O43 P48 R5
    Date: 2017–11
    URL: http://d.repec.org/n?u=RePEc:cep:sercdp:0226&r=big
  7. By: Anna Bruederle; Roland Hodler
    Abstract: Nighttime lights are increasingly used by social scientists as a proxy for economic activity and economic development in subnational spatial units. However, so far, our understanding of what nighttime lights capture is limited. We construct local indicators of household wealth, education and health from geo-coded Demographic and Health Surveys (DHS) for 29 African countries. We show that nighttime lights are positively associated with these indicators across DHS cluster locations as well as across grid cells of roughly 50 x 50 km. We conclude that nighttime lights are a good proxy for human development at the local level.
    Keywords: nighttime lights, local development, Africa
    JEL: I15 I25 I32 O15 O55
    Date: 2017
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_6555&r=big
  8. By: Bertoli, P.; Grembi, V.; Kazakis, P.;
    Abstract: We investigate the consumption of health care by immigrants by using newborn- and motherlevel data from birth certificates. We use a predictive algorithm based on machine learning to identify the observables affecting birth health outcomes and the use of prenatal care. Using these observables, our empirical analysis pinpoints an advantage of immigrants over natives regarding newborns’ birth weight and a lower use of prenatal care and of c-sections by immigrant mothers. To disentangle the healthy immigrant effect explanation for our results from an over-medicalization of pregnancy explanation, we use an IV approach. Our results support the over-medicalization of pregnancy hypothesis.
    Keywords: Healthy Immigrant Effect; Deliveries; Prenatal Care; Consumption of Health care;
    JEL: I12 I14 J15
    Date: 2017–11
    URL: http://d.repec.org/n?u=RePEc:yor:hectdg:17/26&r=big
  9. By: Elenchev, Ivelin; Vasilev, Aleksandar
    Abstract: The present paper develops three models that help predict the success rate and attainable investment levels of online crowdfunding ventures. This is done by applying standard economic theory and machine learning techniques from computer science to the novel sector of on-line crowd-based micro- financing. In contrast with previous research in the area, this paper analyzes transaction-level data in addition to information about completed crowdfunding projects. This provides an unique perspective in the ways crowd finance ventures develop. The models reach an average of 83% accuracy in predicting the outcome of a crowdfunding campaign at any point throughout its duration. These ndings prove that a number of product and project specifi c parameters are indicative of the success of the venture. Subsequently, the paper provides guidance to capital seekers and investors on the basis of these criteria, and allows participants in the crowdfunding marketplace to make more rational decisions.
    Keywords: microfinance,entrepreneurial finance,crowdfunding
    JEL: C0
    Date: 2017
    URL: http://d.repec.org/n?u=RePEc:zbw:esprep:170681&r=big
  10. By: Athey, Susan (Stanford University); Tibshirani, Julie (Stanford University); Wager, Stefan (Stanford University)
    Abstract: We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method operates at a particular point in covariate space by considering a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian, and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: non-parametric quantile regression, conditional average partial effect estimation, and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.
    Date: 2017–07
    URL: http://d.repec.org/n?u=RePEc:ecl:stabus:3575&r=big
  11. By: Wager, Stefan (Stanford University); Athey, Susan (Stanford University)
    Abstract: Many scientific and engineering challenges--ranging from personalized medicine to customized marketing recommendations--require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.
    Date: 2017–07
    URL: http://d.repec.org/n?u=RePEc:ecl:stabus:3576&r=big
  12. By: Laura Palagi (Department of Computer, Control and Management Engineering Antonio Ruberti (DIAG), University of Rome La Sapienza, Rome, Italy)
    Abstract: The paper presents an overview of global issues in optimization methods for Supervised Learning (SL). We focus on Feedforward Neural Networks with the aim of reviewing global methods specifically devised for the class of continuous unconstrained optimization problems arising both in Multi Layer Perceptron/Deep Networks and in Radial Basis Networks. We first recall the learning optimization paradigm for FNN and we briefly discuss global scheme for the joined choice of the network topologies and of the network parameters. The main part of the paper focus on the core subproblem which is the unconstrained regularized weight optimization problem. We review some recent results on the existence of local-non global solutions of the unconstrained nonlinear problem and the role of determining a global solution in a Machine Learning paradigm. Local algorithms that are widespread used to solve the continuous unconstrained problems are addressed with focus on possible improvements to exploit the global properties. Hybrid global methods specifically devised for SL optimization problems which embed local algorithms are discussed at the end.
    Keywords: Supervised Learning ; Feedforward Neural Networks ; Global Optimization ; Weights Optimization ; Hybrid algorithms
    Date: 2017
    URL: http://d.repec.org/n?u=RePEc:aeg:report:2017-11&r=big
  13. By: Ivelin Elenchev; Aleksandar Vasilev (Centre for Economic Theories and Policies, Sofia University St. Kliment Ohridski)
    Abstract: The present paper develops three models that help predict the success rate and attainable investment levels of online crowdfunding ventures. This is done by applying standard economic theory and machine learning techniques from computer science to the novel sector of online crowd-based micro-financing. In contrast with previous research in the area, this paper analyzes transaction-level data in addition to information about completed crowdfunding projects. This provides an unique perspective in the ways crowdfinance ventures develop. The models reach an average of 83% accuracy in predicting the outcome of a crowdfunding campaign at any point throughout its duration. These fundings prove that a number of product and project specific parameters are indicative of the success of the venture. Subsequently, the paper provides guidance to capital seekers and investors on the basis of these criteria, and allows participants in the crowdfunding marketplace to make more rational decisions.
    Keywords: microfinance, entrepreneur finance, crowdfunding
    JEL: M20 G24
    Date: 2017–11
    URL: http://d.repec.org/n?u=RePEc:sko:wpaper:bep-2017-09&r=big

This nep-big issue is ©2017 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.