nep-big New Economics Papers
on Big Data
Issue of 2017‒10‒08
ten papers chosen by
Tom Coupé
University of Canterbury

  1. Double/Debiased Machine Learning for Treatment and Causal Parameters By Victor Chernozhukov; Denis Chetverikov; Mert Demirer; Esther Duflo; Christian Hansen; Whitney Newey; James Robins
  2. Heterogeneous Employment Effects of Job Search Programmes: A Machine Learning Approach By Michael Knaus; Michael Lechner; Anthony Strittmatter
  3. The (Unfulfilled) Potential of Data Marketplaces By Koutroumpis, Pantelis; Leiponen, Aija; Thomas, Llewellyn D W
  4. Noncognitive Skills and Labor Market Outcomes: A Machine Learning Approach By Mareckova, Jana; Pohlmeier, Winfried
  5. L2-Boosting for Economic Applications By Luo, Ye; Spindler, Martin
  6. Locally Robust Semiparametric Estimation By Victor Chernozhukov; Juan Carlos Escanciano; Hidehiko Ichimura; Whitney K. Newey
  7. A logit model for the estimation of the educational level influence on unemployment in Romania By Oancea, Bogdan; Pospisil, Richard; Dragoescu, Raluca
  8. Using debit card payments data for nowcasting Dutch household consumption By Roy Verbaan; Wilko Bolt; Carin van der Cruijsen
  9. Clicking towards Mozambique's New Jobs: A research note By Pedro S. Martins
  10. Planning Ahead for Better Neighborhoods: Long Run Evidence from Tanzania By Neeraj Baruah; Amanda Dahlstrand-Rudin; Guy Michaels; Dzhamilya Nigmatulina; Ferdinand Rauch; Tanner Regan

  1. By: Victor Chernozhukov; Denis Chetverikov; Mert Demirer; Esther Duflo; Christian Hansen; Whitney Newey; James Robins
    Abstract: Most modern supervised statistical/machine learning (ML) methods are explicitly designed to solve prediction problems very well. Achieving this goal does not imply that these methods automatically deliver good estimators of causal parameters. Examples of such parameters include individual regression coefficients, average treatment effects, average lifts, and demand or supply elasticities. In fact, estimates of such causal parameters obtained via naively plugging ML estimators into estimating equations for such parameters can behave very poorly due to the regularization bias. Fortunately, this regularization bias can be removed by solving auxiliary prediction problems via ML tools. Specifically, we can form an orthogonal score for the target low-dimensional parameter by combining auxiliary and main ML predictions. The score is then used to build a de-biased estimator of the target parameter which typically will converge at the fastest possible 1/root(n) rate and be approximately unbiased and normal, and from which valid confidence intervals for these parameters of interest may be constructed. The resulting method thus could be called a "double ML" method because it relies on estimating primary and auxiliary predictive models. In order to avoid overfitting, our construction also makes use of the K-fold sample splitting, which we call cross-fitting. This allows us to use a very broad set of ML predictive methods in solving the auxiliary and main prediction problems, such as random forest, lasso, ridge, deep neural nets, boosted trees, as well as various hybrids and aggregators of these methods.
    Date: 2016–07
  2. By: Michael Knaus; Michael Lechner; Anthony Strittmatter
    Abstract: We systematically investigate the effect heterogeneity of job search programmes for unemployed workers. To investigate possibly heterogeneous employment effects, we combine non-experimental causal empirical models with Lasso-type estimators. The empirical analyses are based on rich administrative data from Swiss social security records. We find considerable heterogeneities only during the first six months after the start of training. Consistent with previous results of the literature, unemployed persons with fewer employment opportunities profit more from participating in these programmes. Furthermore, we also document heterogeneous employment effects by residence status. Finally, we show the potential of easy-to-implement programme participation rules for improving average employment effects of these active labour market programmes.
    Date: 2017–09
  3. By: Koutroumpis, Pantelis; Leiponen, Aija; Thomas, Llewellyn D W
    Abstract: Although industrial datasets are abundant and growing daily, they are not being shared or traded openly and transparently on a large scale. We investigate the nature of data trading with a conceptual market design approach and demonstrate the importance of provenance to overcome protection and quality concerns. We consider the requirements for data marketplaces, comparing existing data marketplaces against standard market design metrics and outline both centralized and decentralized multilateral designs. We assess the benefits and potential operational features of emerging multilateral designs. We conclude with future research directions.
    Keywords: Data marketplaces, data trading, market design
    JEL: D82 K12 L14 O34
    Date: 2017–09–29
  4. By: Mareckova, Jana; Pohlmeier, Winfried
    Abstract: We study the importance of noncognitive skills in explaining differences in the labor market performance of individuals by means of machine learning techniques. Unlike previous em- pirical approaches centering around the within-sample explanatory power of noncognitive skills our approach focuses on the out-of-sample forecasting and classification qualities of noncognitive skills. Moreover, we show that machine learning techniques can cope with the challenge of selecting the most relevant covariates from big data with a whopping number of covariates on personality traits. This enables us to construct new personality indices with larger predictive power. In our empirical application we study the role of noncognitive skills for individual earnings and unemployment based on the British Cohort Study (BCS). The longitudinal character of the BCS enables us to analyze predictive power of early childhood environment and early cognitive and noncognitive skills on adult labor market outcomes. The results of the analysis show that there is a potential of a long run in uence of early childhood variables on the earnings and unemployment.
    Keywords: personality traits,machine learning
    JEL: J24 J64 C38
    Date: 2017
  5. By: Luo, Ye; Spindler, Martin
    Abstract: In the recent years more and more highdimensional data sets, where the number of parameters p is high compared to the number of observations n or even larger, are available for applied researchers. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years and are suitable for the analysis of such data sets. While Lasso has been applied very successfully for highdimensional data sets in Economics, boosting has been underutilized in this field, although it has been proven very powerful in fields like Biostatistics and Pattern Recognition. We attribute this to missing theoretical results for boosting. The goal of this paper is to fill this gap and show that boosting is a competitive method for inference of a treatment effect or instrumental variable (IV) estimation in a high-dimensional setting. First, we present the L2Boosting with componentwise least squares algorithm and variants which are tailored for regression problems which are the workhorse for most Econometric problems. Then we show how L2Boosting can be used for estimation of treatment effects and IV estimation. We highlight the methods and illustrate them with simulations and empirical examples. For further results and technical details we refer to (?) and (?) and to the online supplement of the paper.
    JEL: C21 C26
    Date: 2017
  6. By: Victor Chernozhukov; Juan Carlos Escanciano; Hidehiko Ichimura; Whitney K. Newey
    Abstract: This paper shows how to construct locally robust semiparametric GMM estimators, meaning equivalently moment conditions have zero derivative with respect to the first step and the first step does not affect the asymptotic variance. They are constructed by adding to the moment functions the adjustment term for first step estimation. Locally robust estimators have several advantages. They are vital for valid inference with machine learning in the first step, see Belloni et. al. (2012, 2014), and are less sensitive to the specification of the first step. They are doubly robust for affine moment functions, so moment conditions continue to hold when one first step component is incorrect. Locally robust moment conditions also have smaller bias that is flatter as a function of first step smoothing leading to improved small sample properties. Series first step estimators confer local robustness on any moment conditions and are doubly robust for affine moments, in the direction of the series approximation. Many new locally and doubly robust estimators are given here, including for economic structural models. We give simple asymptotic theory for estimators that use cross-fitting in the first step, including machine learning.
    Date: 2016–07
  7. By: Oancea, Bogdan; Pospisil, Richard; Dragoescu, Raluca
    Abstract: Education is one of the main determinants of the unemployment level in all EU countries. In this paper we used a logit model to estimate the effect of the educational level on the unemployment in Romania using data recorded at the Population and Housing Census 2011. Besides the educational level we also used other socio-demographic variables recorded at the Census like gender, marital status, residential area. Data processing was achieved using R software system and since the data set used for model estimation was very large we used special techniques suited for big data processing. The results showed that the lowest odds ratio to be unemployed was recorded for population with tertiary education which is consistent with other studies at international level and with the official statistics data, but our study indicates that tertiary education has a greater impact on unemployment in Romania than in other EU countries.
    Keywords: educational level, unemployment, logit, higher education
    JEL: I20 J24
    Date: 2016–06–07
  8. By: Roy Verbaan; Wilko Bolt; Carin van der Cruijsen
    Abstract: In this paper we analyse whether the use of debit card payments data improves the accuracy of one-quarter ahead forecasts and nowcasts (current-quarter forecasts) of Dutch private household consumption. Since debit card payments data are timely available, they may be a valuable indicator of economic activity. We study a variety of models with payments data and find that a combination of models provides the most accurate nowcast. The best combined model reduces the root mean squared prediction error (RMSPE) by 18% relative to the macroeconomic policy model (DELFI) that is used by the Dutch central bank (DNB). Based on these results for the Netherlands, we conclude that debit card payments data are useful in modelling household consumption.
    Keywords: Nowcasting; debit card payments; household consumption; Midas
    JEL: C53 E27
    Date: 2017–09
  9. By: Pedro S. Martins
    Abstract: Online jobs portals can be an important source of labour market information, also in developing countries. This paper presents an illustration from Mozambique, a country that has exhibited high economic growth rates but limited employment creation as other countries in Sub-Saharan Africa. First, we highlight the potential but also pitfalls of these portals in characterising and improving the functioning of the labour market. We then analyse the micro (mouse-click-level) data made available by a portal focused on the formal sector of the Mozambique labour market. Our evidence is also consistent with high levels of unemployment and or underemployment. The findings are also suggestive of mismatches between labour demand and the supply of schooling and training.
    Keywords: Big data, Labour market information systems, Internet, Matching
    JEL: J23 J24 J64
    Date: 2017–09
  10. By: Neeraj Baruah; Amanda Dahlstrand-Rudin; Guy Michaels; Dzhamilya Nigmatulina; Ferdinand Rauch; Tanner Regan
    Abstract: What are the long run consequences of planning and providing basic infrastructure in neighborhoods, where people build their own homes? We study "Sites and Services" projects implemented in seven Tanzanian cities during the 1970s and 1980s, half of which provided infrastructure in previously unpopulated areas (de novo neighborhoods), while the other half upgraded squatter settlements. Using satellite images and surveys from the 2010s, we find that de novo neighborhoods developed better housing than adjacent residential areas (control areas) that were also initially unpopulated. Specifically, de novo neighborhood are more orderly and their buildings have larger footprint areas and are more likely to have multiple stories, as well as connections to electricity and water, basic sanitation and access to roads. And though de novo neighborhoods generally attracted better educated residents than control areas, the educational difference is too small to account for the large difference in residential quality that we find. While we have no natural counterfactual for the upgrading areas, descriptive evidence suggests that they are if anything worse than the control areas.
    Keywords: urban economics, economic development, slums, Africa
    JEL: R31 O18 R14
    Date: 2017–09

This nep-big issue is ©2017 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.