nep-big New Economics Papers
on Big Data
Issue of 2018‒01‒29
five papers chosen by
Tom Coupé
University of Canterbury

  1. Using Payroll Processor Microdata to Measure Aggregate Labor Market Activity By Tomaz Cajner; Leland Crane; Ryan Decker; Adrian Hamins-Puertolas; Christopher J. Kurz; Tyler Radler
  2. Spatial Patterns of Development: A Meso Approach By Michalopoulos, Stelios; Papaioannou, Elias
  3. The Global Health Networks: A Comparative Analysis of Tuberculosis, Malaria and Pneumonia Using Social Media Data By Milena Lopreite; Michelangelo Puliga; Massimo Riccaboni
  4. Better Together? Social Networks in Truancy and the Targeting of Treatment By Bennett, Magdalena; Bergman, Peter
  5. Ensemble Learning or Deep Learning? Application to Default Risk Analysis By Shigeyuki Hamori; Minami Kawai; Takahiro Kume; Yuji Murakami; Chikara Watanabe

  1. By: Tomaz Cajner; Leland Crane; Ryan Decker; Adrian Hamins-Puertolas; Christopher J. Kurz; Tyler Radler
    Abstract: We show that high-frequency private payroll microdata can help forecast labor market conditions. Payroll employment is perhaps the most reliable real-time indicator of the business cycle and is therefore closely followed by policymakers, academia, and financial markets. Government statistical agencies have long served as the primary suppliers of information on the labor market and will continue to do so for the foreseeable future. That said, sources of “big data” are becoming increasingly available through collaborations with private businesses engaged in commercial activities that record economic activity on a granular, frequent, and timely basis. One such data source is generated by the firm ADP, which processes payrolls for about one fifth of the U.S. private sector workforce. We evaluate the efficacy of these data to create new statistics that complement existing measures. In particular, we develop a set of weekly aggregate employment indexes from 2000 to 2017, which allows us to measure employment at a higher frequency than is currently possible. The extensive coverage of the ADP data—similar in terms of private employment to the BLS CES sample—implies potentially high information value of these data, and our results confirm this conjecture. Indeed, the timeliness and frequency of the ADP payroll microdata substantially improves forecast accuracy for both current-month employment and revisions to the BLS CES data.
    Keywords: Consumption, saving, production, employment, and investment ; Labor supply and demand ; Forecasting
    JEL: J2 J11 C53 C81
    Date: 2018–01–17
  2. By: Michalopoulos, Stelios; Papaioannou, Elias
    Abstract: Over the last two decades, the literature on comparative development has moved from country-level to within-country analyses. The questions asked have expanded, as economists have used satellite images of light density at night and other big spatial data to proxy for development at the desired level. The focus has also shifted from uncovering correlations to identifying causal relations, using elaborate econometric techniques including spatial regression discontinuity designs. In this survey we show how the combination of geographic information systems with insights from disciplines ranging from the earth sciences to linguistics and history has transformed the research landscape on the roots of the spatial patterns of development. We discuss the limitations of the luminosity data and associated econometric techniques and conclude by offering some thoughts on future research.
    JEL: N00 N9 O10 O43 O55
    Date: 2018–01
  3. By: Milena Lopreite (Scuola Superiore Sant’Anna); Michelangelo Puliga (IMT School for advanced studies); Massimo Riccaboni (IMT School for advanced studies)
    Abstract: Global health networks (GHNs) of organizations fighting major health threats represent a useful strategy to respond to the challenge of mobilizing and coordinating different types of health organizations across borders toward a common goal. In this paper we reconstruct the GHNs of malaria, tuberculosis and pneumonia by creating a new unique database of health organizations from the official Twitter accounts of each organization. We use a majority voter Multi Naive Bayes classifier to discover, among the Twitter users, the ones that represent organizations or groups active in each disease area. We perform a social network analysis (SNA) of the global health networks (GHNs) to evaluate the structure of the network and the role and performance of the organizations in each network. We find evidence that the GHN of malaria, TBC and pneumonia are different in terms of performance and leadership, geographical coverage as well as Twitter popularity. Our analysis validate the use of social media to analyze GHNs, their effectiveness and to mobilize the global community toward global sustainable development.
    Keywords: global health network; social network analysis; machine learning classifier; tuberculosis; malaria; pneumonia; policy evaluation
    JEL: I15 I18 C8
    Date: 2018–01
  4. By: Bennett, Magdalena (Columbia University); Bergman, Peter (Columbia University)
    Abstract: Truancy correlates with many risky behaviors and adverse outcomes. We use detailed administrative data on by-class absences to construct social networks based on students who miss class together. We simulate these networks and use permutation tests to show that certain students systematically coordinate their absences. Leveraging a parent-information intervention on student absences, we find spillover effects from treated students onto peers in their network. We show that an optimal-targeting algorithm that incorporates machine-learning techniques to identify heterogeneous effects, as well as the direct effects and spillover effects, could further improve the efficacy and cost-effectiveness of the intervention subject to a budget constraint.
    Keywords: social networks, peer effects, education
    JEL: I21 D85
    Date: 2018–01
  5. By: Shigeyuki Hamori (Graduate School of Economics, Kobe University); Minami Kawai (Department of Economics, Kobe University); Takahiro Kume (Department of Economics, Kobe University); Yuji Murakami (Department of Economics, Kobe University); Chikara Watanabe (Department of Economics, Kobe University)
    Abstract: Proper credit risk management is essential for lending institutions as substantial losses can be incurred when borrowers default. Consequently, statistical methods that can measure and analyze credit risk objectively are becoming increasing important. This study analyzed default payment data from Taiwan and compared the prediction accuracy and classification ability of three ensemble learning methods—specifically, Bagging, Random Forest, and Boosting—with those of various neural network methods, each of which has a different activation function. The results indicate that Boosting has a high prediction accuracy, whereas that of Bagging and Random Forest is relatively low. They also indicate that the prediction accuracy and classification performance of Boosting is better than that of deep neural networks, Bagging, and Random Forest.
    Keywords: credit risk; ensemble learning; deep learning; bagging; random forest; boosting; deep neural network.
    Date: 2018–01

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.