nep-big New Economics Papers
on Big Data
Issue of 2020‒02‒24
fourteen papers chosen by
Tom Coupé
University of Canterbury

  1. Robots and the origin of their labour-saving impact By Montobbio, Fabio; Staccioli, Jacopo; Virgillito, Maria Enrica; Vivarelli, Marco
  2. Where Should We Go? Internet Searches and Tourist Arrivals By Serhan Cevik
  3. Adaptive Safety Nets for Rural Africa : Drought-Sensitive Targeting with Sparse Data By Baez,Javier E.; Kshirsagar,Varun; Skoufias,Emmanuel
  4. Estimation of Poverty in Somalia Using Innovative Methodologies By Utz Pape; Philip Wollburg
  5. Objective Social Choice: Using Auxiliary Information to Improve Voting Outcomes By Silviu Pitis; Michael R. Zhang
  6. Which Model for Poverty Predictions? By Verme, Paolo
  7. Analyzing the importance of forward orientation in financial development-growth nexus: Evidence from big data By Taniya Ghosh; Prashant Mehul Parab; Sohini Sahu
  8. The effects of the Maputo ring road on the quantity and quality of nearby housing By Fisker Peter; Sohnesen Thomas; Malmgren-Hansen David
  9. Labor Market Analysis Using Big Data : The Case of a Pakistani Online Job Portal By Matsuda,Norihiko; Ahmed,Tutan; Nomura,Shinsaku
  10. Can Medium-Resolution Satellite Imagery Measure Economic Activity at Small Geographies ? Evidence from Landsat in Vietnam By Goldblatt,Ran Philip; Heilmann,Kilian Tobias; Vaizman,Yonatan
  11. Hyperparameter Optimization for Forecasting Stock Returns By Sang Il Lee
  12. Discretization and Machine Learning Approximation of BSDEs with a Constraint on the Gains-Process By Idris Kharroubi; Thomas Lim; Xavier Warin
  13. A random forest based approach for predicting spreads in the primary catastrophe bond market By Despoina Makariou; Pauline Barrieu; Yining Chen
  14. Algorithmic Risk Assessment in the Hands of Humans By Stevenson, Megan T.; Doleac, Jennifer

  1. By: Montobbio, Fabio; Staccioli, Jacopo; Virgillito, Maria Enrica; Vivarelli, Marco
    Abstract: This paper investigates the presence of explicit labour-saving heuristics within robotic patents. It analyses innovative actors engaged in robotic technology and their economic environment (identity, location, industry), and identifies the technological fields particularly exposed to labour-saving innovations. It exploits advanced natural language processing and probabilistic topic modelling techniques on the universe of patent applications at the USPTO between 2009 and 2018, matched with ORBIS (Bureau van Dijk) firm-level dataset. The results show that labour-saving patent holders comprise not only robots producers, but also adopters. Consequently, labour-saving robotic patents appear along the entire supply chain. The paper shows that labour-saving innovations challenge manual activities (e.g. in the logistics sector), activities entailing social intelligence (e.g. in the healthcare sector) and cognitive skills (e.g. learning and predicting).
    Keywords: Robotic Patents,Labour-Saving Technology,Search Heuristics,Probabilistic Topic Models
    JEL: O33 J24 C38
    Date: 2020
  2. By: Serhan Cevik
    Abstract: The widespread availability of internet search data is a new source of high-frequency information that can potentially improve the precision of macroeconomic forecasting, especially in areas with data constraints. This paper investigates whether travel-related online search queries enhance accuracy in the forecasting of tourist arrivals to The Bahamas from the U.S. The results indicate that the forecast model incorporating internet search data provides additional information about tourist flows over a univariate approach using the traditional autoregressive integrated moving average (ARIMA) model and multivariate models with macroeconomic indicators. The Google Trends-augmented model improves predictability of tourist arrivals by about 30 percent compared to the benchmark ARIMA model and more than 20 percent compared to the model extended only with income and relative prices.
    Date: 2020–01–31
  3. By: Baez,Javier E.; Kshirsagar,Varun; Skoufias,Emmanuel
    Abstract: This paper combines remote-sensed data and individual child-, mother-, and household-level data from the Demographic and Health Surveys for five countries in Sub-Saharan Africa (Malawi, Tanzania, Mozambique, Zambia, and Zimbabwe) to design a prototype drought-contingent targeting framework that may be used in scarce-data contexts. To accomplish this, the paper: (i) develops simple and easy-to-communicate measures of drought shocks; (ii) shows that droughts have a large impact on child stunting in these five countries -- comparable, in size, to the effects of mother's illiteracy and a fall to a lower wealth quintile; and (iii) shows that, in this context, decision trees and logistic regressions predict stunting as accurately (out-of-sample) as machine learning methods that are not interpretable. Taken together, the analysis lends support to the idea that a data-driven approach may contribute to the design of policies that mitigate the impact of climate change on the world's most vulnerable populations.
    Date: 2019–12–02
  4. By: Utz Pape (World Bank); Philip Wollburg
    Abstract: Somalia is highly data-deprived, leaving policy makers to operate in a statistical vacuum. To overcome this challenge, the World Bank implemented wave 2 of the Somali High Frequency Survey to better understand livelihoods and vulnerabilities and, especially, to estimate national poverty indicators. The specific context of insecurity and lack of statistical infrastructure in Somalia posed several challenges for implementing a household survey and measuring poverty. This paper outlines how these challenges were overcome in wave 2 of the Somali High Frequency Survey through methodological and technological adaptations in four areas. First, in the absence of a recent census, no exhaustive lists of census enumeration areas along with population estimates existed, creating challenges to derive a probability-based representative sample. Therefore, geo-spatial techniques and high-resolution imagery were used to model the spatial population distribution, build a probability-based population sampling frame, and generate enumeration areas to overcome the lack of a recent population census. Second, although some areas remained completely inaccessible due to insecurity, even most accessible areas held potential risks to the safety of field staff and survey respondents, so that time spent in these areas had to be minimized. To address security concerns, the survey adapted logistical arrangements, sampling strategy using micro- listing, and questionnaire design to limit time on the ground based on the Rapid Consumption Methodology. Third, poverty in completely inaccessible areas had to be estimated by other means. Therefore, the Somali High Frequency Survey relies on correlates derived from satellite imagery and other geo-spatial data to estimate poverty in such areas. Finally, the nonstationary nature of the nomadic population required special sampling strategies.
    Keywords: Consumption Measurement, Poverty, Questionnaire Design JEL Classification: C83, D63, I32
    Date: 2019–05
  5. By: Silviu Pitis; Michael R. Zhang
    Abstract: How should one combine noisy information from diverse sources to make an inference about an objective ground truth? This frequently recurring, normative question lies at the core of statistics, machine learning, policy-making, and everyday life. It has been called "combining forecasts", "meta-analysis", "ensembling", and the "MLE approach to voting", among other names. Past studies typically assume that noisy votes are identically and independently distributed (i.i.d.), but this assumption is often unrealistic. Instead, we assume that votes are independent but not necessarily identically distributed and that our ensembling algorithm has access to certain auxiliary information related to the underlying model governing the noise in each vote. In our present work, we: (1) define our problem and argue that it reflects common and socially relevant real world scenarios, (2) propose a multi-arm bandit noise model and count-based auxiliary information set, (3) derive maximum likelihood aggregation rules for ranked and cardinal votes under our noise model, (4) propose, alternatively, to learn an aggregation rule using an order-invariant neural network, and (5) empirically compare our rules to common voting rules and naive experience-weighted modifications. We find that our rules successfully use auxiliary information to outperform the naive baselines.
    Date: 2020–01
  6. By: Verme, Paolo
    Abstract: OLS models are the predominant choice for poverty predictions in a variety of contexts such as proxy-means tests, poverty mapping or cross-survey impu- tations. This paper compares the performance of econometric and machine learning models in predicting poverty using alternative objective functions and stochastic dominance analysis based on coverage curves. It finds that the choice of an optimal model largely depends on the distribution of incomes and the poverty line. Comparing the performance of different econometric and machine learning models is therefore an important step in the process of opti- mizing poverty predictions and targeting ratios.
    Keywords: Welfare Modelling,Income Distributions,Poverty Predictions,Imputations
    JEL: D31 D63 E64 O15
    Date: 2020
  7. By: Taniya Ghosh (Indira Gandhi Institute of Development Research); Prashant Mehul Parab (Indira Gandhi Institute of Development Research); Sohini Sahu (Indian Institute of Technology Kanpur)
    Abstract: The paper analyzes how the citizens' attitude towards future, obtained using big data, affects the relationship between the nation's financial development and economic growth. All financial development indicators, except for one, show significant negative growth effects. We find that individual's attitude towards future as captured by future orientation index (FOI) plays a significant role in affecting this relation. In particular, FOI interacts with financial development, and weakens the negative effect of financial development on nation's economic growth.
    Keywords: Developing countries, Developed countries, Economic growth, Financial development, Future Orientation Index
    JEL: G2 O16 O47
    Date: 2019–11
  8. By: Fisker Peter; Sohnesen Thomas; Malmgren-Hansen David
    Abstract: Using convolutional neural networks applied to satellite images covering a 25 km x 12 km rectangle on the northern outskirts of Greater Maputo, we detect and classify buildings from 2010 and 2018 in order to compare the development in quantity and quality of buildings from before and after construction of a major section of ring road.In addition, we analyse how the effects vary by distance to the road and conclude that the area has seen large overall growth in both quantity and quality of housing, but it is not possible to distinguish growth close to the road from general urban growth.Finally, the paper contributes methodologically to a growing strand of literature focused on combining machine-learning image recognition and the availability of high-resolution satellite images. We examine the extent to which it is possible to exploit these methods to analyse changes over time and thus provide an alternative (or complement) to traditional impact analyses.
    Keywords: Impact evaluation,infrastructure,remote sensing,Mozambique
    Date: 2019
  9. By: Matsuda,Norihiko; Ahmed,Tutan; Nomura,Shinsaku
    Abstract: Facing a youth bulge?a large influx of a young labor force?the Pakistani economy needs to create more jobs by taking advantage of this relatively well-educated young labor force. Yet, the educated young labor force suffers a higher unemployment rate, and there is a concern that the current education and training system in the country does not respond to skill demands in the private sector. This paper provides new descriptives about labor markets, particularly skill demand and supply, by using online job portal data. The paper finds that although there is an excess supply of highly educated workers, certain industries, such as information and communications technology, lack workers who have specialized skills and experience. The analysis also finds that the exact match of qualifications and skills is important for employers. Job applicants who are underqualified or overqualified for job posts are less likely to be shortlisted than those whose qualifications exactly match job requirements.
    Keywords: Labor Markets,Educational Sciences,Gender and Development,Rural Labor Markets
    Date: 2019–11–20
  10. By: Goldblatt,Ran Philip; Heilmann,Kilian Tobias; Vaizman,Yonatan
    Abstract: This study explores the potential and the limits of medium-resolution satellite data as a proxy for economic activity at small geographic units. Using a commune-level dataset from Vietnam, it compares the performance of commonly used nightlight data and higher resolution Landsat imagery which measures daytime light reflection. The analysis suggests that Landsat outperforms nighttime lights at predicting enterprise counts, employment, and expenditure in simple regression models. A parsimonious combination of the first two moments of the Landsat spectral bands can explain a reasonable share of the variation in economic activity in the cross-section. There is however poor prediction power of either satellite measure for changes over time.
    Date: 2019–12–17
  11. By: Sang Il Lee
    Abstract: In recent years, hyperparameter optimization (HPO) has become an increasingly important issue in the field of machine learning for the development of more accurate forecasting models. In this study, we explore the potential of HPO in modeling stock returns using a deep neural network (DNN). The potential of this approach was evaluated using technical indicators and fundamentals examined based on the effect the regularization of dropouts and batch normalization for all input data. We found that the model using technical indicators and dropout regularization significantly outperforms three other models, showing a positive predictability of 0.53% in-sample and 1.11% out-of-sample, thereby indicating the possibility of beating the historical average. We also demonstrate the stability of the model in terms of the changes in its feature importance over time.
    Date: 2020–01
  12. By: Idris Kharroubi (LPSM UMR 8001); Thomas Lim (LaMME, ENSIIE); Xavier Warin (EDF)
    Abstract: We study the approximation of backward stochastic differential equations (BSDEs for short) with a constraint on the gains process. We first discretize the constraint by applying a so-called facelift operator at times of a grid. We show that this discretely constrained BSDE converges to the continuously constrained one as the mesh grid converges to zero. We then focus on the approximation of the discretely constrained BSDE. For that we adopt a machine learning approach. We show that the facelift can be approximated by an optimization problem over a class of neural networks under constraints on the neural network and its derivative. We then derive an algorithm converging to the discretely constrained BSDE as the number of neurons goes to infinity. We end by numerical experiments. Mathematics Subject Classification (2010): 65C30, 65M75, 60H35, 93E20, 49L25.
    Date: 2020–02
  13. By: Despoina Makariou; Pauline Barrieu; Yining Chen
    Abstract: We introduce a random forest approach to enable spreads' prediction in the primary catastrophe bond market. We investigate whether all information provided to investors in the offering circular prior to a new issuance is equally important in predicting its spread. The whole population of non-life catastrophe bonds issued from December 2009 to May 2018 is used. The random forest shows an impressive predictive power on unseen primary catastrophe bond data explaining 93% of the total variability. For comparison, linear regression, our benchmark model, has inferior predictive performance explaining only 47% of the total variability. All details provided in the offering circular are predictive of spread but in a varying degree. The stability of the results is studied. The usage of random forest can speed up investment decisions in the catastrophe bond industry.
    Date: 2020–01
  14. By: Stevenson, Megan T. (George Mason University); Doleac, Jennifer (Texas A&M University)
    Abstract: We evaluate the impacts of adopting algorithmic predictions of future offending (risk assessments) as an aid to judicial discretion in felony sentencing. We find that judges' decisions are influenced by the risk score, leading to longer sentences for defendants with higher scores and shorter sentences for those with lower scores. However, we find no robust evidence that this reshuffling led to a decline in recidivism, and, over time, judges appeared to use the risk scores less. Risk assessment's failure to reduce recidivism is at least partially explained by judicial discretion in its use. Judges systematically grant leniency to young defendants, despite their high risk of reoffending. This is in line with a long standing practice of treating youth as a mitigator in sentencing, due to lower perceived culpability. Such a conflict in goals may have led prior studies to overestimate the extent to which judges make prediction errors. Since one of the most important inputs to the risk score is effectively off-limits, risk assessment's expected benefits are curtailed. We find no evidence that risk assessment affected racial disparities statewide, although there was a relative increase in sentences for black defendants in courts that appeared to use risk assessment most. We conduct simulations to evaluate how race and age disparities would have changed if judges had fully complied with the sentencing recommendations associated with the algorithm. Racial disparities might have increased slightly, but the largest change would have been higher relative incarceration rates for defendants under the age of 23. In the context of contentious public discussions about algorithms, our results highlight the importance of thinking about how man and machine interact.
    Keywords: crime, risk assessment, prediction, algorithms, courts
    JEL: K4
    Date: 2019–12

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.