nep-big New Economics Papers
on Big Data
Issue of 2017‒11‒05
nine papers chosen by
Tom Coupé
University of Canterbury

  1. The State of Applied Econometrics - Causality and Policy Evaluation By Susan Athey; Guido Imbens
  2. Planning Ahead for Better Neighborhoods: Long Run Evidence from Tanzania By Guy Michaels; Dzhamilya Nigmatulina; Ferdinand Rauch; Tanner Regan; Neeraj Baruah; Amanda Dahlstrand-Rudin
  3. Social Capital and Labor Market Networks By Brian J. Asquith; Judith K. Hellerstein; Mark J. Kutzbach; David Neumark
  4. Data Network Effects: Implications for Data Business By Mitomo, Hitoshi
  5. Using Spatial Factor Analysis to Measure Human Development By Qihua Qiu; Jaesang Sung; Will Davis; Rusty Tchernis
  6. Generalized Random Forests By Susan Athey; Julie Tibshirani; Stefan Wager
  7. Regulatory Learning: how to supervise machine learning models? An application to credit scoring By Dominique Guegan; Bertrand Hassani
  8. Calibration of Machine Learning Classifiers for Probability of Default Modelling By Pedro G. Fonseca; Hugo D. Lopes
  9. Retail credit scoring using fine-grained payment data By TOBBACK, Ellen; MARTENS, David

  1. By: Susan Athey; Guido Imbens
    Abstract: In this paper we discuss recent developments in econometrics that we view as important for empirical researchers working on policy evaluation questions. We focus on three main areas, where in each case we highlight recommendations for applied work. First, we discuss new research on identification strategies in program evaluation, with particular focus on synthetic control methods, regression discontinuity, external validity, and the causal interpretation of regression methods. Second, we discuss various forms of supplementary analyses to make the identification strategies more credible. These include placebo analyses as well as sensitivity and robustness analyses. Third, we discuss recent advances in machine learning methods for causal effects. These advances include methods to adjust for differences between treated and control units in high-dimensional settings, and methods for identifying and estimating heterogeneous treatment effects.
    Date: 2016–07
  2. By: Guy Michaels; Dzhamilya Nigmatulina; Ferdinand Rauch; Tanner Regan; Neeraj Baruah; Amanda Dahlstrand-Rudin
    Abstract: What are the long run consequences of planning and providing basic infrastructure in neighborhoods, where people build their own homes? We study “Sites and Services” projects implemented in seven Tanzanian cities during the 1970s and 1980s, half of which provided infrastructure in previously unpopulated areas (de novo neighborhoods), while the other half upgraded squatter settlements. Using satellite images and surveys from the 2010s, we find that de novo neighborhoods developed better housing than adjacent residential areas (control areas) that were also initially unpopulated. Specifically, de novo neighborhood are more orderly and their buildings have larger footprint areas and are more likely to have multiple stories, as well as connections to electricity and water, basic sanitation and access to roads. And though de novo neighborhoods generally attracted better educated residents than control areas, the educational difference is too small to account for the large difference in residential quality that we find. While we have no natural counterfactual for the upgrading areas, descriptive evidence suggests that they are if anything worse than the control areas.
    Keywords: urban economics, economic development, slums, Africa
    JEL: R31 O18 R14
    Date: 2017
  3. By: Brian J. Asquith; Judith K. Hellerstein; Mark J. Kutzbach; David Neumark
    Abstract: We explore the links between social capital and labor market networks at the neighborhood level. We harness rich data taken from multiple sources, including matched employer-employee data with which we measure the strength of labor market networks, data on behavior such as voting patterns that have previously been tied to social capital, and new data – not previously used in the study of social capital – on the number and location of non-profits at the neighborhood level. We use a machine learning algorithm to identify potential social capital measures that best predict neighborhood-level variation in labor market networks. We find evidence suggesting that smaller and less centralized schools, and schools with fewer poor students, foster social capital that builds labor market networks, as does a larger Republican vote share. The presence of establishments in a number of non-profit oriented industries are identified as predictive of strong labor market networks, likely because they either provide public goods or facilitate social contacts. These industries include, for example, churches and other religious institutions, schools, country clubs, and amateur or recreational sports teams or clubs.
    JEL: J01 J64 R23
    Date: 2017–10
  4. By: Mitomo, Hitoshi
    Abstract: This paper aims to investigate the existence of “data network effects” in data platform services such as Big Data, Internet-of-Things (IoT) and Artificial Intelligence (AI) and their influence on the diffusion of the services. It intendsto present a preliminary formal analysis ofthe effects of data network externalities.Policy implications will be discussed in terms of the diffusion of services.
    Date: 2017
  5. By: Qihua Qiu; Jaesang Sung; Will Davis; Rusty Tchernis
    Abstract: We propose a Bayesian factor analysis model as an alternative to the Human Development Index (HDI). Our model provides methodology which can either augment or build additional indices. In addition to addressing potential issues of the HDI, we estimate human development with three auxiliary variables capturing environmental health and sustainability, income inequality, and satellite observed nightlight. We also use our method to build a Millennium Development Goals (MDG) index as an example of constructing a more complex index. We find the “living standard” dimension provides a greater contribution to human development than the official HDI suggests, while the “longevity” dimension provides a lower proportional contribution. Our results also show considerable levels of disagreement relative to the ranks of official HDI. We report the sensitivity of our method to different specifications of spatial correlation, cardinal-to-ordinal data transforms, and data imputation procedures, along with the results of a simulated data exercise.
    JEL: O15 O57
    Date: 2017–10
  6. By: Susan Athey; Julie Tibshirani; Stefan Wager
    Abstract: We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method operates at a particular point in covariate space by considering a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian, and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: non-parametric quantile regression, conditional average partial effect estimation, and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.
    Date: 2016–10
  7. By: Dominique Guegan (Centre d'Economie de la Sorbonne and LabEx ReFi); Bertrand Hassani (Group Capgemini and Centre d'Economie de la Sorbonne and LabEx ReFi)
    Abstract: The arrival of big data strategies is threatening the lastest trends in financial regulation related to the simplification of models and the enhancement of the comparability of approaches chosen by financial institutions. Indeed, the intrinsic dynamic philosophy of Big Data strategies is almost incompatible with the current legal and regulatory framework as illustrated in this paper. Besides, as presented in our application to credit scoring, the model selection may also evolve dynamically forcing both practitioners and regulators to develop libraries of models, strategies allowing to switch from one to the other as well as supervising approaches allowing financial institutions to innovate in a risk mitigated environment. The purpose of this paper is therefore to analyse the issues related to the Big Data environment and in particular to machine learning models highlighting the issues present in the current framework confronting the data flows, the model selection process and the necessity to generate appropriate outcomes.
    Keywords: Big Data; Credit scoring; machine learning; AUC; regulation
    Date: 2017–07
  8. By: Pedro G. Fonseca; Hugo D. Lopes
    Abstract: Binary classification is highly used in credit scoring in the estimation of probability of default. The validation of such predictive models is based both on rank ability, and also on calibration (i.e. how accurately the probabilities output by the model map to the observed probabilities). In this study we cover the current best practices regarding calibration for binary classification, and explore how different approaches yield different results on real world credit scoring data. The limitations of evaluating credit scoring models using only rank ability metrics are explored. A benchmark is run on 18 real world datasets, and results compared. The calibration techniques used are Platt Scaling and Isotonic Regression. Also, different machine learning models are used: Logistic Regression, Random Forest Classifiers, and Gradient Boosting Classifiers. Results show that when the dataset is treated as a time series, the use of re-calibration with Isotonic Regression is able to improve the long term calibration better than the alternative methods. Using re-calibration, the non-parametric models are able to outperform the Logistic Regression on Brier Score Loss.
    Date: 2017–10
  9. By: TOBBACK, Ellen; MARTENS, David
    Abstract: In this big data era, banks (like any other large company) are looking for novel ways to leverage their existing data assets. A major data source that has not been used to the full extent yet, is the massive fine-grained payment data on their customers. In this paper, a design is proposed that builds predictive credit scoring models using the fine-grained payment data. Using a real-life data set of 183 million transactions made by 2.6 million customers, we show that our proposed design adds complementary predictive power to the current credit scoring models. Such improvement has a big impact on the overall working of the bank, from applicant scoring to minimum capital requirements.
    Date: 2017–10

This nep-big issue is ©2017 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.