nep-big New Economics Papers
on Big Data
Issue of 2019‒04‒08
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. Artificial Intelligence: A European Perspective By Alessandro Annoni; Peter Benczur; Paolo Bertoldi; Blagoj Delipetrev; Giuditta De Prato; Claudio Feijoo; Enrique Fernandez Macias; Emilia Gomez Gutierrez; Maria Iglesias Portela; Henrik Junklewitz; Montserrat Lopez Cobo; Bertin Martens; Susana Figueiredo do Nascimento; Stefano Nativi; Alexandre Polvora; Jose Ignacio Sanchez Martin; Songul Tolan; Ilkka Tuomi; Lucia Vesnic Alujevic
  2. The race against the robots and the fallacy of the giant cheesecake: Immediate and imagined impacts of artificial intelligence By Naude, Wim
  3. Property rights, market access and crop cultivation in Southern Rhodesia: evidence from historical satellite data By Tawanda Chingozha; Dieter von Fintel
  4. Using Deep Learning Neural Networks and Candlestick Chart Representation to Predict Stock Market By Rosdyana Mangir Irawan Kusuma; Trang-Thi Ho; Wei-Chun Kao; Yu-Yen Ou; Kai-Lung Hua
  5. What Is the Value Added by Using Causal Machine Learning Methods in a Welfare Experiment Evaluation? By Strittmatter, Anthony
  6. Deep Learning in Asset Pricing By Luyang Chen; Markus Pelger; Jason Zhu
  7. Housing and Discrimination in Economics: an Empirical Approach using Big Data and Natural Experiments By Jean-Benoît Eymeoud
  8. Food inflation nowcasting with web scraped data By Paweł Macias; Damian Stelmasiak
  9. Estimating Heterogeneous Reactions to Experimental Treatments By Christoph Engel
  10. How BLUE is the Sky? Estimating the Air Quality Data in Beijing During the Blue Sky Day Period (2008-2012) by the Bayesian LSTM Approach By Han, Y.; Li, V.; Lam, J., Pollitt, M.; Pollitt, M.
  11. Population, light, and the size distribution of cities By Christian Duben; Melanie Krause
  12. Observing Economic Growth in Unrecognized States with Nighttime Light By Masayuki Kudamatsu
  13. Detection of ship lights at sea By Matveev, Aleksey (Матвеев, Алексей); Andreev, Alexander (Андреев, Александр); Zhizhin, Mikhail (Жижин, Михаил); Poyda, Alexey (Пойда, Алексей); Troussov, Alexander (Трусов, Александр)
  14. Big Data in Finance and the Growth of Large Firms By Juliane Begenau; Maryam Farboodi; Laura Veldkamp
  15. Improving metadata infrastructure for complex surveys: 
Insights from the Fragile Families Challenge By Alexander Kindel; Vineet Bansal; Kristin Catena; Thomas Hartshorne; Kate Jaeger
  16. Academic offer and demand for advanced profiles in the EU. Artificial Intelligence, High Performance Computing and Cybersecurity By Montserrat Lopez-Cobo; Giuditta De Prato; Georgios Alaveras; Riccardo Righi; Sofia Samoili; Jiri Hradec; Lukasz Ziemba; Katarzyna Pogorzelska; Melisande Cardona
  17. Historical Analysis of National Subjective Wellbeing using Millions of Digitized Books By Hills, Thomas; Illushka Seresinhe, Chanuki; Proto, Eugenio; Sgroi, Daniel

  1. By: Alessandro Annoni (European Commission - JRC); Peter Benczur (European Commission - JRC); Paolo Bertoldi; Blagoj Delipetrev; Giuditta De Prato; Claudio Feijoo; Enrique Fernandez Macias; Emilia Gomez Gutierrez; Maria Iglesias Portela; Henrik Junklewitz; Montserrat Lopez Cobo; Bertin Martens; Susana Figueiredo do Nascimento; Stefano Nativi; Alexandre Polvora; Jose Ignacio Sanchez Martin; Songul Tolan; Ilkka Tuomi; Lucia Vesnic Alujevic
    Abstract: We are only at the beginning of a rapid period of transformation of our economy and society due to the convergence of many digital technologies. Artificial Intelligence (AI) is central to this change and offers major opportunities to improve our lives. The recent developments in AI are the result of increased processing power, improvements in algorithms and the exponential growth in the volume and variety of digital data. Many applications of AI have started entering into our every-day lives, from machine translations, to image recognition, and music generation, and are increasingly deployed in industry, government, and commerce. Connected and autonomous vehicles, and AI-supported medical diagnostics are areas of application that will soon be commonplace. There is strong global competition on AI among the US, China, and Europe. The US leads for now but China is catching up fast and aims to lead by 2030. For the EU, it is not so much a question of winning or losing a race but of finding the way of embracing the opportunities offered by AI in a way that is human-centred, ethical, secure, and true to our core values. The EU Member States and the European Commission are developing coordinated national and European strategies, recognising that only together we can succeed. We can build on our areas of strength including excellent research, leadership in some industrial sectors like automotive and robotics, a solid legal and regulatory framework, and very rich cultural diversity also at regional and sub-regional levels. It is generally recognised that AI can flourish only if supported by a robust computing infrastructure and good quality data: • With respect to computing, we identified a window of opportunity for Europe to invest in the emerging new paradigm of computing distributed towards the edges of the network, in addition to centralised facilities. This will support also the future deployment of 5G and the Internet of Things. • With respect to data, we argue in favour of learning from successful Internet companies, opening access to data and developing interactivity with the users rather than just broadcasting data. In this way, we can develop ecosystems of public administrations, firms, and civil society enriching the data to make it fit for AI applications responding to European needs. We should embrace the opportunities afforded by AI but not uncritically. The black box characteristics of most leading AI techniques make them opaque even to specialists. AI systems are currently limited to narrow and well-defined tasks, and their technologies inherit imperfections from their human creators, such as the well-recognised bias effect present in data. We should challenge the shortcomings of AI and work towards strong evaluation strategies, transparent and reliable systems, and good human-AI interactions. Ethical and secure-by-design algorithms are crucial to build trust in this disruptive technology, but we also need a broader engagement of civil society on the values to be embedded in AI and the directions for future development. This social engagement should be part of the effort to strengthen our resilience at all levels from local, to national and European, across institutions, industry and civil society. Developing local ecosystems of skills, computing, data, and applications can foster the engagement of local communities, respond to their needs, harness local creativity and knowledge, and build a human-centred, diverse, and socially driven AI. We still know very little about how AI will impact the way we think, make decisions, relate to each other, and how it will affect our jobs. This uncertainty can be a source of concern but is also a sign of opportunity. The future is not yet written. We can shape it based on our collective vision of what future we would like to have. But we need to act together and act fast.
    Keywords: artificial intelligence, AI strategy, AI Techno-economic segment, Ethics, Legal, education, economic, cybersecurity, data strategies, computing architectures, energy, resilience
    Date: 2018–12
  2. By: Naude, Wim (UNU-MERIT, Maastricht University and MSM, and RWTH Aachen, and IZA Bonn)
    Abstract: After a number of AI-winters, AI is back with a boom. There are concerns that it will disrupt society. The immediate concern is whether labor can win a `race against the robots' and the longer-term concern is whether an artificial general intelligence (super-intelligence) can be controlled. This paper describes the nature and context of these concerns, reviews the current state of the empirical and theoretical literature in economics on the impact of AI on jobs and inequality, and discusses the challenge of AI arms races. It is concluded that despite the media hype neither massive jobs losses nor a `Singularity' are imminent. In part, this is because current AI, based on deep learning, is expensive and dificult for (especially small) businesses to adopt, can create new jobs, and is an unlikely route to the invention of a super-intelligence. Even though AI is unlikely to have either utopian or apocalyptic impacts, it will challenge economists in coming years. The challenges include regulation of data and algorithms; the (mis-) measurement of value added; market failures, anti-competitive behaviour and abuse of market power; surveillance, censorship, cybercrime; labor market discrimination, declining job quality; and AI in emerging economies.
    Keywords: Technology, artificial intelligence, productivity, labor demand, innovation, inequality
    JEL: O47 O33 J24 E21 E25
    Date: 2019–03–07
  3. By: Tawanda Chingozha (Department of Economics, Stellenbosch University); Dieter von Fintel (Department of Economics, Stellenbosch University and Institute of Labor Economics (IZA), Bonn)
    Abstract: Agriculture plays a central role in the efforts to fight poverty and achieve economic growth. This is especially relevant in sub-Saharan Africa (SSA) where the majority of the population lives in rural areas. A key issue that is generally believed to unlock agriculture potential is the recognition of property rights through land titling, yet there is no overwhelming empirical evidence to support this in the case of SSA (Udry, 2011). This paper investigates access to markets as an important pre-condition for land titles to result in agricultural growth. Using the case of Southern Rhodesia, we investigate whether land titles incentivised African large-scale holders in the Native Purchase Areas (NPAs) to put more of their available land under cultivation than their counterparts in the overcrowded Tribal Trust Areas (TTAs). We create a novel dataset by applying a Support Vector Machine (SVM) learning algorithm on Landsat imagery for the period 1972 to 1984 - the period during which the debate on the nexus between land rights and agricultural production intensified. Our results indicate that land titles are only beneficial when farmers are located closer to main cities, main roads and rail stations or sidings.
    Keywords: land titling, access to markets, machine learning, remote sensing
    JEL: C81 N37 Q13 Q15
    Date: 2019
  4. By: Rosdyana Mangir Irawan Kusuma; Trang-Thi Ho; Wei-Chun Kao; Yu-Yen Ou; Kai-Lung Hua
    Abstract: Stock market prediction is still a challenging problem because there are many factors effect to the stock market price such as company news and performance, industry performance, investor sentiment, social media sentiment and economic factors. This work explores the predictability in the stock market using Deep Convolutional Network and candlestick charts. The outcome is utilized to design a decision support framework that can be used by traders to provide suggested indications of future stock price direction. We perform this work using various types of neural networks like convolutional neural network, residual network and visual geometry group network. From stock market historical data, we converted it to candlestick charts. Finally, these candlestick charts will be feed as input for training a Convolutional Neural Network model. This Convolutional Neural Network model will help us to analyze the patterns inside the candlestick chart and predict the future movements of stock market. The effectiveness of our method is evaluated in stock market prediction with a promising results 92.2% and 92.1% accuracy for Taiwan and Indonesian stock market dataset respectively. The constructed model have been implemented as a web-based system freely available at for predicting stock market using candlestick chart and deep learning neural networks.
    Date: 2019–02
  5. By: Strittmatter, Anthony
    Abstract: Recent studies have proposed causal machine learning (CML) methods to estimate conditional average treatment effects (CATEs). In this study, I investigate whether CML methods add value compared to conventional CATE estimators by re-evaluating Connecticut's Jobs First welfare experiment. This experiment entails a mix of positive and negative work incentives. Previous studies show that it is hard to tackle the effect heterogeneity of Jobs First by means of CATEs. I report evidence that CML methods can provide support for the theoretical labor supply predictions. Furthermore, I document reasons why some conventional CATE estimators fail and discuss the limitations of CML methods.
    Keywords: Labor supply, individualized treatment effects, conditional average treatment effects, random forest
    JEL: H75 I38 J22 J31 C21
    Date: 2019
  6. By: Luyang Chen; Markus Pelger; Jason Zhu
    Abstract: We estimate a general non-linear asset pricing model with deep neural networks applied to all U.S. equity data combined with a substantial set of macroeconomic and firm-specific information. Our crucial innovation is the use of the no-arbitrage condition as part of the neural network algorithm. We estimate the stochastic discount factor (SDF or pricing kernel) that explains all asset prices from the conditional moment constraints implied by no-arbitrage. For this purpose, we combine three different deep neural network structures in a novel way: A feedforward network to capture non-linearities, a recurrent Long-Short-Term-Memory network to find a small set of economic state processes, and a generative adversarial network to identify the portfolio strategies with the most unexplained pricing information. Our model allows us to understand what are the key factors that drive asset prices, identify mis-pricing of stocks and generate the mean-variance efficient portfolio. Empirically, our approach outperforms out-of-sample all other benchmark approaches: Our optimal portfolio has an annual Sharpe Ratio of 2.1, we explain 8% of the variation in individual stock returns and explain over 90% of average returns for all anomaly sorted portfolios.
    Date: 2019–03
  7. By: Jean-Benoît Eymeoud (Département d'économie)
    Abstract: Le premier chapitre documente un paramètre clé pour comprendre le marché du logement : l'élasticité de l'offre de logements des aires urbaines françaises. Nous montrons que cette élasticité peut être appréhendée de deux manières en considérant l’offre intensive et extensive de logements. Grâce à une quantité importante de nouvelles données collectées et une stratégie d'estimation originale, ce premier chapitre estime et décompose les deux élasticités. Le deuxième chapitre est consacré aux possibilités offertes par le Big Data pour étudier le marché de logement locatif français. En exploitant des données en ligne de décembre 2015 à juin 2017 et comparant ces données aux données administratives classiques, nous montons qu’internet fournit des données permettant de suivre avec exactitude les marchés immobiliers locaux. Le troisième chapitre porte sur la discrimination des femmes en politique. Il exploite une expérience naturelle, les élections départementales françaises de 2015 au cours desquelles, pour la première fois dans l'histoire des élections françaises, les candidats ont dû se présenter par paires de candidats obligatoirement mixtes. En utilisant le fait que l'ordre d'apparition des candidats sur un bulletin de vote était déterminé par l’ordre alphabétique et en montrant que cette règle ne semble pas avoir été utilisée de façon stratégique par les partis, nous montrons d’une part que la position des femmes sur le bulletin de vote est aléatoire, et d’autre part, que les binômes de droite pour qui le nom du candidat féminin est en première position sur le bulletin reçoivent en moyenne 1,5 point de pourcentage de moins de votes
    Keywords: Urban Economics, Discrimination, Big Date, Public Policy; Marché du logement, Discrimination, Big data, Politiques publiques
    Date: 2018–10
  8. By: Paweł Macias (Narodowy Bank Polski); Damian Stelmasiak (Narodowy Bank Polski)
    Abstract: In this paper we evaluate the ability of web scraped data to improve nowcasts of Polish food inflation. The nowcasting performance of online price indices is compared with aggregated and disaggregated benchmark models in a pseudo realtime experiment. We also explore product selection and classification problems, their importance in constructing web price indices and other limitations of online datasets. Therefore, we experiment not only with raw indices, but also with several approaches to include them into model-based forecasts. Our findings indicate that the optimal way to incorporate web scraped data into regular forecasting is to include them in simple distributed-lag models at the lowest aggregation level, combine the forecasts and aggregate them using statistical office methodology. We find this approach superior to other benchmark models which do not take online information into account.
    Keywords: web scraping, nowcasting, inflation, big data, online prices
    JEL: E37 C81
    Date: 2019
  9. By: Christoph Engel (Max Planck Institute for Research on Collective Goods)
    Abstract: Frequently in experiments there is not only variance in the reaction of participants to treatment. The heterogeneity is patterned: discernible types of participants react differently. In principle, a finite mixture model is well suited to simultaneously estimate the probability that a given participant belongs to a certain type, and the reaction of this type to treatment. Yet often, finite mixture models need more data than the experiment provides. The approach requires ex ante knowledge about the number of types. Finite mixture models are hard to estimate for panel data, which is what experiments often generate. For repeated experiments, this paper offers a simple two-step alternative that is much less data hungry, that allows to find the number of types in the data, and that allows for the estimation of panel data models. It combines machine learning methods with classic frequentist statistics.
    Keywords: heterogeneous treatment effect, finite mixture model, panel data, two-step approach, machine learning, CART
    JEL: C14 C23 C91
    Date: 2019–01
  10. By: Han, Y.; Li, V.; Lam, J., Pollitt, M.; Pollitt, M.
    Abstract: Over the last three decades, air pollution has become a major environmental challenge in many of the fast growing cities in China, including Beijing. Given that any long-term exposure to high-levels of air pollution has devastating health consequences, accurately monitoring and reporting air pollution information to the public is critical for ensuring public health and safety and facilitating rigorous air pollution and health-related scientific research. Recent statistical research examining China’s air quality data has posed questions regarding data accuracy, especially data reported during the Blue Sky Day (BSD) period (2000 – 2012), though the accuracy of publicly available air quality data in China has improved gradually over the recent years (2013 – 2017). To the best of our understanding, no attempt has been made to re-estimate the air quality data during the BSD period. In this paper, we put forward a machine-learning model to re-estimate the official air quality data during the BSD period of 2008 – 2012, based on the PM2.5 data of the Beijing US Embassy, and the proxy data covering Aerosol Optical Depth (AOD) and meteorology. Results have shown that the average re-estimated daily air quality values are respectively 64% and 61% higher than the official values, for air quality index (AQI) and AQI equivalent PM2.5, during the BSD period of 2008 to 2012. Moreover, the re-estimated BSD air quality data exhibit reduced statistical discontinuity and irregularity, based on our validation tests. The results suggest that the proposed data re-estimation methodology has the potential to provide more justifiable historical air quality data for evidence-based environmental decision-making in China.
    Keywords: Blue Sky Day (BSD), Air Quality, Beijing, Data Irregularity, Bayesian LSTM, Data Estimation
    JEL: C53 C63 Q53
    Date: 2019–03–21
  11. By: Christian Duben (Hamburg University); Melanie Krause (Hamburg University)
    Abstract: We provide new insights on the city size distribution of countries around the world. Using geo-spatial data and a globally consistent city identification scheme, our data set contains 13,844 cities in 194 countries. City size is measured both in terms of population and night time lights proxying for local economic activity. We find that Zipf's law holds for many, but not all, countries in terms of population, while city size in terms of light is distributed more unequally. These deviations from Zipf's law are to a large extent driven by an undue concentration in the largest cities. They benefit from agglomeration effects which seem to work through scale rather than through density. Examining the cross-country heterogeneity in the city size distribution, our model selection approach suggests that historical factors play an important role, in line with the time of development hypothesis.
    Keywords: Cities, Zipf's Law, Urban Concentration, Geo-spatial Data.
    JEL: R11 R12 O18 C18
    Date: 2019–01
  12. By: Masayuki Kudamatsu (Associate Professor, Osaka School of International Public Policy, Osaka University)
    Abstract: This paper uses the satellite images of nighttime light to estimate economic growth rates in four unrecognized states of the former Soviet Union: Nagorno-Karabakh in Azerbaijan, Abkhazia and South Ossetia in Georgia, and Transnistria in Moldova. We then compare these estimates against those similarly obtained for the parent states to gauge the impact of non-recognition as sovereign states on economic activities. The estimated economic growth rates do not differ much between the breakaway territories and their parent states, suggesting that the economic impact of non-recognition as states may be fairly limited.
    Keywords: Unrecognized states, the former Soviet Union, satellite data, civil conflicts, economic growth
    JEL: D74 O43 P48
    Date: 2019–03
  13. By: Matveev, Aleksey (Матвеев, Алексей) (The Russian Presidential Academy of National Economy and Public Administration); Andreev, Alexander (Андреев, Александр) (The Russian Presidential Academy of National Economy and Public Administration); Zhizhin, Mikhail (Жижин, Михаил) (The Russian Presidential Academy of National Economy and Public Administration); Poyda, Alexey (Пойда, Алексей) (The Russian Presidential Academy of National Economy and Public Administration); Troussov, Alexander (Трусов, Александр) (The Russian Presidential Academy of National Economy and Public Administration)
    Abstract: The avalanche increase in the amount of satellite data received by modern sensors has become a major obstacle to their use for manual detection of fishing vessels by fishing agencies and other organizations. In this regard, it was necessary to develop an algorithm and an automatic system for detecting night ship lights using satellite data and analyzing their distribution. This paper presents an algorithm for detecting night ship lights using satellite data from the VIIRS sensor, describes a software system that implements the developed algorithm, describes the developed methods and tools for analyzing the distribution of night ship lights, and presents the results of testing the developed methods.
    Date: 2019–03
  14. By: Juliane Begenau; Maryam Farboodi; Laura Veldkamp
    Date: 2018
  15. By: Alexander Kindel (Princeton University); Vineet Bansal (Princeton University); Kristin Catena (Princeton University); Thomas Hartshorne (Princeton University); Kate Jaeger (Princeton University)
    Abstract: Researchers rely on metadata systems to prepare data for analysis. As the complexity of datasets increases and the breadth of data analysis practices grow, existing metadata systems can limit the efficiency and quality of data preparation. This article describes the redesign of a metadata system supporting the Fragile Families and Child Wellbeing Study based on the experiences of participants in the Fragile Families Challenge. We demonstrate how treating metadata as data—that is, releasing comprehensive information about variables in a format amenable to both automated and manual processing—can make the task of data preparation less arduous and less error-prone for all types of data analysis. We hope that our work will facilitate new applications of machine learning methods to longitudinal surveys and inspire research on data preparation in the social sciences. We have open-sourced the tools we created so that others can use and improve them.
    Keywords: metadata, survey research, data sharing, quantitative methodology, computational social science
    JEL: F13
    Date: 2018–10
  16. By: Montserrat Lopez-Cobo (European Commission - JRC); Giuditta De Prato (European Commission - JRC); Georgios Alaveras (European Commission - JRC); Riccardo Righi (European Commission - JRC); Sofia Samoili (European Commission - JRC); Jiri Hradec (European Commission - JRC); Lukasz Ziemba (European Commission - JRC); Katarzyna Pogorzelska (European Commission - JRC); Melisande Cardona (European Commission - JRC)
    Abstract: This study aims at supporting the policy initiatives to develop the availability in EC Member States of adequate advanced digital skills in a number of IT domains including Artificial Intelligence, High Performance Computing and Cybersecurity. By making use of the Techno-Economic Segments analytical approach developed under the PREDICT3 project (joint effort of EC JRC and DG CNECT), the study collects data and builds quantitative indicators to provide evidence based policy support. It addresses the mapping of digital skills in the mentioned technological domains from two complementary perspectives: the existing offer of academic courses (bachelor, master and doctoral programs), and the demand of profiles by industry as reflected by industry activity in the referred fields.
    Keywords: digital skills, industry demand, educational offer, artificial intelligence, cybersecurity, high performance computing, digital transformation
    Date: 2019–03
  17. By: Hills, Thomas; Illushka Seresinhe, Chanuki; Proto, Eugenio; Sgroi, Daniel
    Abstract: In addition to improving quality of life, higher subjective wellbeing leads to fewer health problems, higher productivity, and better incomes. For these reasons subjective wellbeing has become a key focal issue among scientific researchers and governments. Yet no scientific investigator knows how happy humans were in previous centuries. Here we show that a new method based on quantitative analysis of digitized text from millions of books published over the past 200 years captures reliable trends in historical subjective wellbeing across four nations. This method uses psychological valence norms for thousands of words to compute the relative proportion of positive and negative language, indicating relative happiness during national and international wars, financial crises, and in comparison to historical trends in longevity and GDP. We validate our method using Eurobarometer survey data from the 1970s onwards and in comparison with economic, medical, and political events since 1820 and also use a set of words with stable historical meanings to support our findings. Finally we show that our results are robust to the use of diverse corpora (including text derived from newspapers) and different word norms.
    Date: 2019–03

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.