nep-big New Economics Papers
on Big Data
Issue of 2017‒08‒13
six papers chosen by
Tom Coupé
University of Canterbury

  1. Big data in Stata with the ftools package By Sergio Correia
  2. Analyzing satellite data in Stata By Hiren Nisar
  3. Now You See Me: High School Dropout and Machine Learning By Dario Sansone; Pooya Almasi
  4. Exploring the Potential of Machine Learning for Automatic Slum Identification from VHE Imagery By Duque, Juan Carlos; Patino, Jorge Eduardo; Betancourt, Alejandro
  5. Propensity Scores and Causal Inference Using Machine Learning Methods By Austin Nichols; Linden McBride
  6. Machine learning in sentiment reconstruction of the simulated stock market By Mikhail Goykhman; Ali Teimouri

  1. By: Sergio Correia (Board of Governors of the Federal Reserve System)
    Abstract: In recent years, very large datasets have become increasingly prevalent in most social sciences. However, some of the most important Stata commands (collapse, egen, merge, sort, etc.) rely on algorithms that are not well suited for big data. In my talk, I will present the ftools package, which contains plug-in alternatives to these commands and performs up to 20 times faster on large datasets. Further, I will explain the underlying algorithm and Mata function, and show how to use this function to create new Stata commands and to speed up existing packages.
    Date: 2017–08–10
  2. By: Hiren Nisar (Abt Associates)
    Abstract: We provide examples of how one can use satellite or other remote sensing data in Stata, with a variety of analysis methods, including examples of measuring economic disadvantage using satellite imagery.
    Date: 2017–08–10
  3. By: Dario Sansone (Georgetown University); Pooya Almasi (Georgetown University)
    Abstract: In this paper, we create an algorithm to predict which students are eventually going to drop out of US high school using information available in 9th grade. We show that using a naive model - as implemented in many schools - leads to poor predictions. In addition to this, we explain how schools can obtain more precise predictions by exploiting the big data available to them, as well as more sophisticated quantitative techniques. We also compare the performances of econometric techniques like Logistic Regression with Machine Learning tools such as Support Vector Machine, Boosting and LASSO. We offer practical advice on how to apply the new Machine Learning codes available in Stata to the high dimensional datasets available in education. Model parameters are calibrated by taking into account policy goals and budget constraints.
    Date: 2017–08–10
  4. By: Duque, Juan Carlos; Patino, Jorge Eduardo; Betancourt, Alejandro
    Abstract: Slum identification in urban settlements is a crucial step in the process of formulation of propoor policies. However, the use of conventional methods for slums detection such as field surveys may result time consuming and costly. This paper explores the possibility of implementing a low-cost standardized method for slum detection. We use spectral, texture and structural features extracted from very high spatial resolution imagery as input data and evaluate the capability of three machine learning algorithms (Logistic Regression, Support Vector Machine and Random Forest) to classify urban areas as slum or no-slum. Using data from Buenos Aires (Argentina), Medellin (Colombia), and Recife (Brazil), we found that Support Vector Machine with radial basis kernel deliver the best performance (over 0.81). We also found that singularities within cities preclude the use of a unified classification model.
    Keywords: Ciudades, Desarrollo urbano, Economía, Equidad e inclusión social, Georreferenciación, Investigación socioeconómica, Pobreza, Políticas públicas, Servicios públicos, Vivienda,
    Date: 2016
  5. By: Austin Nichols (Abt Associates); Linden McBride (Cornell University)
    Abstract: We compare a variety of methods for predicting the probability of a binary treatment (the propensity score), with the goal of comparing otherwise like cases in treatment and control conditions for causal inference about treatment effects. Better prediction methods can under some circumstances improve causal inference both by reducing the finite sample bias and variability of estimators, but sometimes better predictions of the probability of treatment can increase bias and variance, and we clarify the conditions under which different methods produce better or worse inference (in terms of mean squared error of causal impact estimates).
    Date: 2017–08–10
  6. By: Mikhail Goykhman; Ali Teimouri
    Abstract: In this paper we continue the study of the simulated stock market framework defined by the driving sentiment processes. We focus on the market environment driven by the buy/sell trading sentiment process of the Markov chain type. We apply the methodology of the Hidden Markov Models and the Recurrent Neural Networks to reconstruct the transition probabilities matrix of the Markov sentiment process and recover the underlying sentiment states from the observed stock price behavior.
    Date: 2017–08

This nep-big issue is ©2017 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.