nep-big New Economics Papers
on Big Data
Issue of 2018‒08‒20
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. A Brief History of Human Time. Exploring a database of " notable people " By Olivier Gergaud; Morgane Laouenan; Etienne Wasmer
  2. Artificial Intelligence, Economics, and Industrial Organization By Hal Varian
  3. Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments By Victor Chernozhukov; Mert Demirer; Esther Duflo; Iván Fernández-Val
  4. Take a Look Around: Using Street View and Satellite Images to Estimate House Prices By Stephen Law; Brooks Paige; Chris Russell
  5. Agent-Based Model Calibration using Machine Learning Surrogates By Francesco Lamperti; Andrea Roventini; Amir Sani
  6. Machine Learning for Dynamic Models of Imperfect Information and Semiparametric Moment Inequalities By Vira Semenova
  7. Stock Price Correlation Coefficient Prediction with ARIMA-LSTM Hybrid Model By Hyeong Kyu Choi
  8. Joint data platforms as X factor for efficiency gains in the public sector? By Piret Tõnurist; Veiko Lember; Rainer Kattel
  9. Peer Effects in Water Conservation: Evidence from Consumer Migration By Bryan Bollinger; Jesse Burkhardt; Kenneth Gillingham
  10. A spatially accurate method for evaluating distributional effects of ecosystem services By Aliza Fleischer; Daniel Felsenstein; Michal Lichter
  11. Data for Good: Unlocking Privately-Held Data to the Benefit of the Many By Alemanno, Alberto
  12. Cross Validation Based Model Selection via Generalized Method of Moments By Junpei Komiyama; Hajime Shimao
  13. A Machine Learning Approach for Detecting Students at Risk of Low Academic Achievement By Sarah Cornell-Farrow; Robert Garrard
  14. AI and the Economy By Jason Furman; Robert Seamans
  15. Learning to Average Predictively over Good and Bad: Comment on: Using Stacking to Average Bayesian Predictive Distributions By Lennart (L.F.) Hoogerheide; Herman (H.K.) van Dijk
  16. Using Social Network Activity Data to Identify and Target Job Seekers By Ebbes, Peter; Netzer, Oded
  17. Labor Market Effects of Credit Constraints: Evidence from a Natural Experiment By Kumar, Anil; Liang, Che-Yuan
  18. Determinants of the Survival Ratio for De Jure Standards: AI-related technologies and interaction with patents By TAMURA Suguru
  19. Ouvrir la boîte noire des marchés du logement By Alexandre Coulondre

  1. By: Olivier Gergaud (KEDGE Business School [Talence] - M.E.N.E.S.R. - Ministère de l'Éducation nationale, de l’Enseignement supérieur et de la Recherche); Morgane Laouenan (CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique); Etienne Wasmer (ECON - Département d'économie - Sciences Po)
    Abstract: This paper describes a database of 1,243,776 notable people and 7,184,575 locations (Geolinks) associated with them throughout human history (3000BCE-2015AD). We first describe in details the various approaches and procedures adopted to extract the relevant information from their Wikipedia biographies and then analyze the database. Ten main facts emerge. 1. There has been an exponential growth over time of the database, with more than 60% of notable people still living in 2015, with the exception of a relative decline of the cohort born in the XVIIth century and a local minimum between 1645 and 1655. 2. The average lifespan has increased by 20 years, from 60 to 80 years, between the cohort born in 1400AD and the one born in 1900AD. 3. The share of women in the database follows a U-shape pattern, with a minimum in the XVIIth century and a maximum at 25% for the most recent cohorts. 4. The fraction of notable people in governance occupations has decreased while the fraction in occupations such as arts, literature/media and sports has increased over the centuries; sports caught up to arts and literature for cohorts born in 1870 but remained at the same level until the 1950s cohorts; and eventually sports came to dominate the database after 1950. 5. The top 10 visible people born before 1890 are all non-American and have 10 different nationalities. Six out of the top 10 born after 1890 are instead U.S. born citizens. Since 1800, the share of people from Europe and the U.S. in the database declines, the number of people from Asia and the Southern Hemisphere grows to reach 20% of the database in 2000. Coïncidentally, in 1637, the exact barycenter of the base was in the small village of Colombey-les-Deux-Eglises (Champagne Region in France), where Charles de Gaulle lived and passed away. Since the 1970s, the barycenter oscillates between Morocco, Algeria and Tunisia. 6. The average distance between places of birth and death follows a U-shape pattern: the median distance was 316km before 500AD, 100km between 500 and 1500AD, and has risen continuously since then. The greatest mobility occurs between the age of 15 and 25. 7. Individuals with the highest levels of visibility tend to be more distant from their birth place, with a median distance of 785km for the top percentile as compared to 389km for the top decile and 176km overall. 8. In all occupations, there has been a rise in international mobility since 1960. The fraction of locations in a country different from the place of birth went from 15% in 1955 to 35% after 2000. 9. There is no positive association between the size of cities and the visibility of people measured at the end of their life. If anything, the correlation is negative. 10. Last and not least, we find a positive correlation between the contemporaneous number of entrepreneurs and the urban growth of the city in which they are located the following decades; more strikingly, the same is also true with the contemporaneous number or share of artists, positively affecting next decades city growth; instead, we find a zero or negative correlation between the contemporaneous share of "militaries, politicians and religious people" and urban growth in the following decades.
    Keywords: Big Data,notable people
    Date: 2017–01–19
  2. By: Hal Varian
    Abstract: Machine learning (ML) and artificial intelligence (AI) have been around for many years. However, in the last 5 years, remarkable progress has been made using multilayered neural networks in diverse areas such as image recognition, speech recognition, and machine translation. AI is a general purpose technology that is likely to impact many industries. In this chapter I consider how machine learning availability might affect the industrial organization of both firms that provide AI services and industries that adopt AI technology. My intent is not to provide an extensive overview of this rapidly-evolving area, but instead to provide a short summary of some of the forces at work and to describe some possible areas for future research.
    JEL: L0
    Date: 2018–07
  3. By: Victor Chernozhukov; Mert Demirer; Esther Duflo; Iván Fernández-Val
    Abstract: We propose strategies to estimate and make inference on key features of heterogeneous effects in randomized experiments. These key features include best linear predictors of the effects using machine learning proxies, average effects sorted by impact groups, and average characteristics of most and least impacted units. The approach is valid in high dimensional settings, where the effects are proxied by machine learning methods. We post-process these proxies into the estimates of the key features. Our approach is generic, it can be used in conjunction with penalized methods, deep and shallow neural networks, canonical and new random forests, boosted trees, and ensemble methods. It does not rely on strong assumptions. In particular, we don’t require conditions for consistency of the machine learning methods. Estimation and inference relies on repeated data splitting to avoid overfitting and achieve validity. For inference, we take medians of p-values and medians of confidence intervals, resulting from many different data splits, and then adjust their nominal level to guarantee uniform validity. This variational inference method is shown to be uniformly valid and quantifies the uncertainty coming from both parameter estimation and data splitting. An empirical application to the impact of micro-credit on economic development illustrates the use of the approach in randomized experiments.
    JEL: C18 C21 D14 G21 O16
    Date: 2018–06
  4. By: Stephen Law; Brooks Paige; Chris Russell
    Abstract: When an individual purchases a home, they simultaneously purchase its structural features, its accessibility to work, and the neighborhood amenities. Some amenities, such as air quality, are measurable whilst others, such as the prestige or the visual impression of a neighborhood, are difficult to quantify. Despite the well-known impacts intangible housing features have on house prices, limited attention has been given to systematically quantifying these difficult to measure amenities. Two issues have lead to this neglect. Not only do few quantitative methods exist that can measure the urban environment, but that the collection of such data is both costly and subjective. We show that street image and satellite image data can capture these urban qualities and improve the estimation of house prices. We propose a pipeline that uses a deep neural network model to automatically extract visual features from images to estimate house prices in London, UK. We make use of traditional housing features such as age, size and accessibility as well as visual features from Google Street View images and Bing aerial images in estimating the house price model. We find encouraging results where learning to characterize the urban quality of a neighborhood improves house price prediction, even when generalizing to previously unseen London boroughs. We explore the use of non-linear vs. linear methods to fuse these cues with conventional models of house pricing, and show how the interpretability of linear models allows us to directly extract the visual desirability of neighborhoods as proxy variables that are both of interest in their own right, and could be used as inputs to other econometric methods. This is particularly valuable as once the network has been trained with the training data, it can be applied elsewhere, allowing us to generate vivid dense maps of the desirability of London streets.
    Date: 2018–07
  5. By: Francesco Lamperti (Laboratory of Economics and Management (LEM) - Scuola Superiore Sant'Anna [Pisa]); Andrea Roventini (OFCE - Observatoire Français des Conjonctures économiques - Institut d'Études Politiques [IEP] - Paris - Fondation Nationale des Sciences Politiques [FNSP], Laboratory of Economics and Management (LEM) - Scuola Superiore Sant'Anna [Pisa]); Amir Sani (CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique)
    Abstract: Taking agent-based models (ABM) closer to the data is an open challenge. This paper explicitly tackles parameter space exploration and calibration of ABMs combining supervised machine-learning and intelligent sampling to build a surrogate meta-model. The proposed approach provides a fast and accurate approximation of model behaviour, dramatically reducing computation time. In that, our machine-learning surrogate facilitates large scale explorations of the parameter-space, while providing a powerful filter to gain insights into the complex functioning of agent-based models. The algorithm introduced in this paper merges model simulation and output analysis into a surrogate meta-model, which substantially ease ABM calibration. We successfully apply our approach to the Brock and Hommes (1998) asset pricing model and to the " Island " endogenous growth model (Fagiolo and Dosi, 2003). Performance is evaluated against a relatively large out-of-sample set of parameter combinations, while employing different user-defined statistical tests for output analysis. The results demonstrate the capacity of machine learning surrogates to facilitate fast and precise exploration of agent-based models' behaviour over their often rugged parameter spaces.
    Keywords: meta-model,agent based model,surrogate,calibration,machine learning
    Date: 2017–04–03
  6. By: Vira Semenova
    Abstract: This paper develops estimation and inference tools for the structural parameter in the dynamic game with a high-dimensional state space under the assumption that the data are generated by a single Markov perfect equilibrium. The equilibrium assumption implies that the expected value function evaluated at the equilibrium strategy profile is not smaller than the expected value function evaluated at a feasible suboptimal alternative. The target identified set is defined as the set of parameters obeying this inequality restriction. I estimate the expected value function in the two-stage procedure. At the first stage, I estimate the law of motion of the state variable and the equilibrium policy function using modern machine learning methods. At the second stage, I construct the estimator of the expected value function as the sum of the naive plug-in estimator expected value function of ? and the bias correction term which removes the bias of the naive estimator. The proposed estimator of the identified set converges at the root-N rate to the true identified set and can be used to construct its confidence regions.
    Date: 2018–08
  7. By: Hyeong Kyu Choi
    Abstract: Predicting the price correlation of two assets for future time periods is important in portfolio optimization. We apply LSTM recurrent neural networks (RNN) in predicting the stock price correlation coefficient of two individual stocks. RNNs are competent in understanding temporal dependencies. The use of LSTM cells further enhances its long term predictive properties. To encompass both linearity and nonlinearity in the model, we adopt the ARIMA model as well. The ARIMA model filters linear tendencies in the data and passes on the residual value to the LSTM model. The ARIMA LSTM hybrid model is tested against other traditional predictive financial models such as the full historical model, constant correlation model, single index model and the multi group model. In our empirical study, the predictive ability of the ARIMA-LSTM model turned out superior to all other financial models by a significant scale. Our work implies that it is worth considering the ARIMA LSTM model to forecast correlation coefficient for portfolio optimization.
    Date: 2018–08
  8. By: Piret Tõnurist; Veiko Lember; Rainer Kattel
    Abstract: Data analytics and interoperability have become pivotal issues for the creation of new public services. Furthermore, new informational technological (IT) solutions influence organisational boundaries and can become drivers of centralization or decentralization alike. In this article we argue that increasing capacity for data analytics and data use (e.g., through joint platforms) engender a new form of coordination in the public sector – machine to machine coordination. We seek to answer whether such coordination practices based on interoperable data platforms also introduce efficiency gains to the public sector? In this working paper we connect these three interrelated topics: first, how joint data platforms affect inter-organisational information sharing (i.e. collaborative service provision); second, if and how efficiency gains can be achieved by these collaborative initiatives; and lastly, how organisations and governance change in the public sector through the implementation of these initiatives. To exemplify this research puzzle, the case of the Estonian data exchange platform, X-road is examined.
    Date: 2016–08
  9. By: Bryan Bollinger; Jesse Burkhardt; Kenneth Gillingham
    Abstract: Social interactions are widely understood to influence consumer decisions in many choice settings. This paper identifies causal peer effects in water conservation during the growing season, utilizing variation from consumer migration. We use machine learning to classify high-resolution remote sensing images to provide evidence that conversion to dry landscaping underpins the peer effects in water consumption. We also provide evidence that without a price signal, peer effects are muted, demonstrating a complementarity between information transmission and prices. These results inform water use policy in many areas of the world threatened by recurring drought conditions.
    JEL: L95 Q25 R23
    Date: 2018–07
  10. By: Aliza Fleischer; Daniel Felsenstein; Michal Lichter
    Abstract: The value of most ecosystem services invariably slips through national accounts. Even when these values are estimated, they are allocated without any particular spatial referencing. Little is known about the spatial and distributional effects arising from changes in ecosystem service provision. This paper estimates spatial equity in ecosystem services provision using a dedicated data disaggregation algorithm that allocates 'synthetic' socioeconomic attributes to households and with accurate geo-referencing. A GIS-based automated procedure is operationalized for three different ecosystem in Israel. A nonlinear function relates household location to each ecosystem: beaches, urban parks and national parks. Benefit measures are derived by modeling household consumer surplus as a function of socio-economic attributes and distance from the ecosystem. These aggregate measures are spatially disaggregated to households. Results show that restraining access to beaches causes a greater reduction in welfare than restraining access to a park. Progressively, high income households lose relatively more in welfare terms than low income households from such action. This outcome is reversed when distributional outcomes are measured in terms of housing price classes. Policy implications of these findings relate to implications for housing policies that attempt to use new development to generate social heterogeneity in locations proximate to ecosystem services.
    Keywords: Agribusiness, Agricultural Finance
    Date: 2017–11–12
  11. By: Alemanno, Alberto
    Abstract: It is almost a truism to argue that data holds a great promise of transformative resources for social good, by helping to address a complex range of societal issues, ranging from saving lives in the aftermath of a natural disaster to predicting teen suicides. Yet it is not public authorities who hold this real-time data, but private entities, such as mobile network operators and business card companies, and - with even greater detail - tech firms such as Google through its globally-dominant search engine, and, in particular, social media platforms, such as Facebook and Twitter. Besides a few isolated and self-proclaimed ‘data philanthropy’ initiatives and other corporate data-sharing collaborations, data-rich companies have historically shown resistance to not only share this data for the public good, but also to identify its inherent social, non-commercial benefit. How to explain to citizens across the world that their own data - which has been aggressively harvested over time - can’t be used, and not even in emergency situations? Responding to this unsettling question entails a fascinating research journey for anyone interested in how the promises of big data could deliver for society as a whole. In the absence of a plausible solution, the number of societal problems that won’t be solved unless firms like Facebook, Google and Apple start coughing up more data-based evidence will increase exponentially, as well as societal rejection of their underlying business models. This article identifies the major challenges of unlocking private-held data to the benefit of society and sketches a research agenda for scholars interested in collaborative and regulatory solutions aimed at unlocking privately-held data for good.
    Keywords: Big data; data; data governance; data sharing; data risk; data invisible; risk governance; philanthropy;
    JEL: I18 K23 K32 K40
    Date: 2018–06–11
  12. By: Junpei Komiyama; Hajime Shimao
    Abstract: Structural estimation is an important methodology in empirical economics, and a large class of structural models are estimated through the generalized method of moments (GMM). Traditionally, selection of structural models has been performed based on model fit upon estimation, which take the entire observed samples. In this paper, we propose a model selection procedure based on cross-validation (CV), which utilizes sample-splitting technique to avoid issues such as over-fitting. While CV is widely used in machine learning communities, we are the first to prove its consistency in model selection in GMM framework. Its empirical property is compared to existing methods by simulations of IV regressions and oligopoly market model. In addition, we propose the way to apply our method to Mathematical Programming of Equilibrium Constraint (MPEC) approach. Finally, we perform our method to online-retail sales data to compare dynamic market model to static model.
    Date: 2018–07
  13. By: Sarah Cornell-Farrow; Robert Garrard
    Abstract: We aim to predict whether a primary school student will perform in the `below standard' band of a standardized test based on a set of individual, school-level, and family-level observables. We exploit a data set containing test performance on the National Assessment Program - Literacy and Numeracy (NAPLAN); a test given annually to all Australian primary school students in grades 3, 5, 7, and 9. Students who perform in the `below standard' band constitute approximately 3% of the sample, with the remainder performing at or above standard, requiring that a proposed classifier be robust to imbalanced classes. Observations for students in grades 5, 7, and 9 contain data on previous achievement in NAPLAN. We separate the analysis into students in grade 5 and above, for which previous achievement may be used as a predictor; and students in grade 3, which must rely on family and school-level predictors only. On each subset of the data, we train and compare a set of classifiers in order to predict below standard performance in reading and numeracy learning areas respectively. The best classifiers for grades 5 and above achieve an area under the ROC curve of approximately 95%, and for grade 3 achieve an AUC of approximately 80%. Our results suggest that it is feasible for schools to screen a large number of students for their risk of obtaining below standard achievement a full two years before they are identified as achieving below standard on their next NAPLAN test.
    Date: 2018–07
  14. By: Jason Furman; Robert Seamans
    Abstract: We review the evidence that artificial intelligence (AI) is having a large effect on the economy. Across a variety of statistics—including robotics shipments, AI startups, and patent counts—there is evidence of a large increase in AI-related activity. We also review recent research in this area which suggests that AI and robotics have the potential to increase productivity growth but may have mixed effects on labor, particularly in the short run. In particular, some occupations and industries may do well while others experience labor market upheaval. We then consider current and potential policies around AI that may help to boost productivity growth while also mitigating any labor market downsides including evaluating the pros and cons of an AI specific regulator, expanded antitrust enforcement, and alternative strategies for dealing with the labor-market impacts of AI, including universal basic income and guaranteed employment.
    JEL: H23 J24 J65 L1 L4 L78 O3 O4
    Date: 2018–06
  15. By: Lennart (L.F.) Hoogerheide (VU University Amsterdam); Herman (H.K.) van Dijk (Erasmus University, Norges Bank)
    Abstract: We suggest to extend the stacking procedure for a combination of predictive densities, proposed by Yao et al in the journal Bayesian Analysis to a setting where dynamic learning occurs about features of predictive densities of possibly misspecified models. This improves the averaging process of good and bad model forecasts. We summarise how this learning is done in economics and finance using mixtures. We also show that our proposal can be extended to combining forecasts and policies. The technical tools necessary for the implementation refer to filtering methods from nonlinear time series and we show their connection with machine learning. We illustrate our suggestion using results from Basturk et al based on financial data about US portfolios from 1928 until 2015.
    Keywords: Bayesian learning; predictive density combinations
    JEL: C11 C15
    Date: 2018–08–08
  16. By: Ebbes, Peter; Netzer, Oded
    Abstract: An important challenge for many firms is to identify the life transitions of its customers, such as job searching, being pregnant, or purchasing a home. Inferring such transitions, which are generally unobserved to the firm, can offer the firm opportunities to be more relevant to its customers. In this paper, we demonstrate how a social network platform can leverage its longitudinal user data to identify which of its users are likely job seekers. Identifying job seekers is at the heart of the business model of professional social network platforms. Our proposed approach builds on the hidden Markov model (HMM) framework to recover the latent state of job search from noisy signals obtained from social network activity data. Specifically, our modeling approach combines cross-sectional survey responses to a job seeking status question with longitudinal user activity data. Thus, in some time periods, and for some users, we observe the “true” job seeking status. We fuse the observed state information into the HMM likelihood, resulting in a partially HMM. We demonstrate that the proposed model can not only predict which users are likely to be job seeking at any point in time, but also what activities on the platform are associated with job search, and how long the users have been job seeking. Furthermore, we find that targeting job seekers based on our proposed approach can lead to a 42% increase in profits of a targeting campaign relative to the approach that was used at the time of the data collection.
    Keywords: Hidden Markov Models; Data Fusion; Targeting; Customer Analytics
    JEL: C10 J20 J40 M30
    Date: 2018–06–01
  17. By: Kumar, Anil (Federal Reserve Bank of Dallas); Liang, Che-Yuan (Uppsala University)
    Abstract: We exploit the 1998 and 2003 constitutional amendment in Texas—allowing home equity loans and lines of credit for non-housing purposes—as natural experiments to estimate the effect of easier credit access on the labor market. Using state-level as well as county-level data and the synthetic control approach, we find that easier access to housing credit led to a notably lower labor force participation rate between 1998 and 2007. We show that our findings are remarkably robust to improved synthetic control methods based on insights from machine-learning. We explore treatment effect heterogeneity using grouped data from the basic monthly CPS and find that declines in the labor force participation rate were larger among females, prime age individuals, and the college-educated. Analysis of March CPS data confirms that the negative effect of easier home equity access on labor force participation was largely concentrated among homeowners, with little discernible impact on renters, as expected. We find that, while the labor force participation rate experienced persistent declines following the amendments that allowed access to home equity, the impact on GDP growth was relatively muted. Our research shows that labor market effects of easier credit access should be an important factor when assessing its stimulative impact on overall growth.
    Keywords: Credit Constraints and Labor Supply; Synthetic Control with Machine Learning
    JEL: E24 E65 J21 R23
    Date: 2018–08–01
  18. By: TAMURA Suguru
    Abstract: In this study, I focus on the characteristics of standardized technologies and the effective terms of regional de jure standards for discussing the effectiveness of their management system. Namely, the focus is on the total cost of ownership of standards. To this end, I analyze the determining factors considering the Japanese Industrial Standards of approximately 14,000 cases. I conduct a semi-parametric survival analysis to study the determinants of the effective term. The results demonstrate that the technological category and type of technical standards (e.g., product and design) significantly affect the effective term. International standards, legislative use, and standard essential patents are also shown to have significant influence on the effective term. Moreover, the technological category of artificial intelligence (AI)-related standards is established, and the effective term of AI-related technology is found to be significantly shorter. These findings contribute to better designing of the global and regional standards management processes, which in turn will contribute to improvements in the efficiency of global and regional innovation systems.
    Date: 2018–08
  19. By: Alexandre Coulondre (LAB'URBA - LAB'URBA - UPEM - Université Paris-Est Marne-la-Vallée - UPEC UP12 - Université Paris-Est Créteil Val-de-Marne - Paris 12)
    Abstract: Alors que le Parlement français examine la possibilité de donner accès aux données immobilières locales détenues par l'administration fiscale, ce texte revient sur les enjeux d'une telle ouverture. Il présente les sources actuellement disponibles et montre en quoi les données fiscales permettraient de comprendre davantage l'évolution locale des marchés immobiliers résidentiels.
    Keywords: DVF,Marchés du logement,Immobilier,Open data,Loi ESSOC,Big data
    Date: 2018–07–09

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.