nep-big New Economics Papers
on Big Data
Issue of 2020‒04‒27
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. CoronaNet: A Dyadic Dataset of Government Responses to the COVID-19 Pandemic By Cheng, Cindy; Barcelo, Joan; Hartnett, Allison; Kubinec, Robert; Messerschmidt, Luca
  2. Artificial Intelligence against COVID-19: An Early Review By Naudé, Wim
  3. Political Networks across the Globe By Commander, Simon; Poupakis, Stavros
  4. Using the Eye of the Storm to Predict the Wave of Covid-19 UI Claims By Daniel Aaronson; Scott A. Brave; R. Andrew Butters; Daniel W. Sacks; Boyoung Seo
  5. Tracking Labor Market Developments during the COVID-19 Pandemic: A Preliminary Assessment By Tomaz Cajner; Leland Crane; Ryan Decker; Adrian Hamins-Puertolas; Christopher J. Kurz
  6. Technological Innovation and Discrimination in Household Finance By Adair Morse; Karen M. Pence
  7. The Use of Artificial Intelligence in Health Care: Liability Issues By Mélanie Bourassa Forcier; Lara Khoury; Nathalie Vézina
  8. The Semicircular Flow of the Data Economy and the Data Sharing Laffer curve By de Pedraza, Pablo; Vollbracht, Ian
  9. Modelling Non-stationary 'Big Data' By Jennifer Castle; Jurgen Doornik; David Hendry
  10. The evolution of inequality of opportunity in Germany: A machine learning approach By Brunori, Paolo; Neidhöfer, Guido
  12. The dynamics of non-performing loans during banking crises: a new database By Ari, Anil; Ratnovski, Lev; Chen, Sophia
  13. State, development and efficiency of digitalization in Bulgarian agriculture By Bachev, Hrabrin
  14. Expanding and Diversifying the Pool of Undergraduates who Study Economics: Insights from a New Introductory Course at Harvard By Amanda Bayer; Gregory Bruich; Raj Chetty; Andrew Housiaux
  15. Digitalization and Platforms in Agriculture: Organizations, Power Asymmetry, and Collective Action Solutions By Kenney, Martin; Serhan, Hiam; Trystram, Gilles
  16. Robust Discovery of Regression Models By Jennifer L. Castle; Jurgen A. Doornik; David F. Hendry
  17. The value of publicly available, textual and non-textual information for startup performance prediction By Kaiser, Ulrich; Kuhn, Johan M.

  1. By: Cheng, Cindy; Barcelo, Joan; Hartnett, Allison; Kubinec, Robert (Princeton University); Messerschmidt, Luca
    Abstract: As the COVID-19 pandemic spreads around the world, governments have implemented a broad set of policies to limit the spread of the pandemic. In this paper we present an initial release of a large hand-coded dataset of more than 4,500 separate policy announcements from governments around the world. This data is being made publicly available, in combination with other data that we have collected (including COVID-19 tests, cases, and deaths) as well as a number of country-level covariates. Due to the speed of the COVID-19 outbreak, we will be releasing this data on a daily basis with a 5-day lag for record validity checking. In a truly global effort, our team is comprised of more than 190 research assistants across 18 time zones and makes use of cloud-based managerial and data collection technology in addition to machine learning coding of news sources. We analyze the dataset with a Bayesian time-varying ideal point model showing the quick acceleration of more harsh policies across countries beginning in mid-March and continuing to the present. While some relatively low-cost policies like task forces and health monitoring began early, countries generally adopted more harsh measures within a narrow time window, suggesting strong policy diffusion effects.
    Date: 2020–04–12
  2. By: Naudé, Wim (RWTH Aachen University)
    Abstract: Artificial Intelligence (AI) is a potentially powerful tool in the fight against the COVID- 19 pandemic. Since the outbreak of the pandemic, there has been a scramble to use AI. This article provides an early, and necessarily selective review, discussing the contribution of AI to the fight against COVID-19, as well as the current constraints on these contributions. Six areas where AI can contribute to the fight against COVID-19 are discussed, namely i) early warnings and alerts, ii) tracking and prediction, iii) data dashboards, iv) diagnosis and prognosis, v) treatments and cures, and vi) social control. It is concluded that AI has not yet been impactful against COVID-19. Its use is hampered by a lack of data, and by too much data. Overcoming these constraints will require a careful balance between data privacy and public health, and rigorous human-AI interaction. It is unlikely that these will be addressed in time to be of much help during the present pandemic. In the meantime, extensive gathering of diagnostic data on who is infectious will be essential to save lives, train AI, and limit economic damages.
    Keywords: data science, health, Coronavirus, COVID-19, artificial intelligence, development, technology, innovation
    JEL: O32 O39 I19 O20
    Date: 2020–04
  3. By: Commander, Simon (IE Business School, Altura Partners); Poupakis, Stavros (University of Oxford)
    Abstract: Political networks are an important feature of the political and economic landscape of countries. Despite their ubiquity and significance, information on such networks has proven hard to collect due to a pervasive lack of transparency. However, with the advent of big data and artificial intelligence, major financial services institutions are now actively collating publicly available information on politically exposed persons and their networks. In this study, we use one such data set to show how network characteristics vary across political systems. We provide results from more than 150 countries and show how the format of the network tends to reflect the extent of democratisation of each country. We also outline further avenues for research using such data.
    Keywords: political networks, rent-seeking, democratic consolidation
    JEL: D72 H11 P26 P36 N44
    Date: 2020–03
  4. By: Daniel Aaronson; Scott A. Brave; R. Andrew Butters; Daniel W. Sacks; Boyoung Seo
    Abstract: We leverage an event-study research design focused on the seven costliest hurricanes to hit the US mainland since 2004 to identify the elasticity of unemployment insurance filings with respect to search intensity. Applying our elasticity estimate to the state-level Google Trends indexes for the topic “unemployment,” we show that out-of-sample forecasts made ahead of the official data releases for March 21 and 28 predicted to a large degree the extent of the Covid-19 related surge in the demand for unemployment insurance. In addition, we provide a robust assessment of the uncertainty surrounding these estimates and demonstrate their use within a broader forecasting framework for US economic activity.
    Keywords: Covid-19; Google trends; hurricanes; unemployment; unemployment insurance
    JEL: C53 H12 J65
    Date: 2020–04–07
  5. By: Tomaz Cajner; Leland Crane; Ryan Decker; Adrian Hamins-Puertolas; Christopher J. Kurz
    Abstract: Many traditional official statistics are not suitable for measuring high-frequency developments that evolve over the course of weeks, not months. In this paper, we track the labor market effects of the COVID-19 pandemic with weekly payroll employment series based on microdata from ADP. These data are available essentially in real-time, and allow us to track both aggregate and industry effects. Cumulative losses in paid employment through April 4 are currently estimated at 18 million; just during the two weeks between March 14 and March 28 the U.S. economy lost about 13 million paid jobs. For comparison, during the entire Great Recession less than 9 million private payroll employment jobs were lost. In the current crisis, the most affected sector is leisure and hospitality, which has so far lost or furloughed about 30 percent of employment, or roughly 4 million jobs.
    Keywords: Big data; Economic measurement; Labor market; COVID-19
    JEL: C53 C55 C81 J11 J20
    Date: 2020–04–16
  6. By: Adair Morse; Karen M. Pence
    Abstract: Technology has changed how discrimination manifests itself in financial services. Replacing human discretion with algorithms in decision-making roles reduces taste-based discrimination, and new modeling techniques have expanded access to financial services to households who were previously excluded from these markets. However, algorithms can exhibit bias from human involvement in the development process, and their opacity and complexity can facilitate statistical discrimination inconsistent with antidiscrimination laws in several aspects of financial services provision, including advertising, pricing, and credit-risk assessment. In this chapter, we provide a new amalgamation and analysis of these developments, identifying five gateways whereby technology induces discrimination to creep into financial services. We also consider how these technological changes in finance intersect with existing discrimination and data privacy laws, leading to our contribution of four frontlines of regulation. Our analysis concludes that the net effect of innovation in technological finance on discrimination is ambiguous and depends on the future choices made by policymakers, the courts, and firms.
    Keywords: Discrimination; Fair lending; Statistical discrimination; FinTech; Taste-based preferences; Algorithmic decision-making; Proxy variables; Big data
    JEL: G21 G28 O33
    Date: 2020–02–20
  7. By: Mélanie Bourassa Forcier; Lara Khoury; Nathalie Vézina
    Abstract: This paper explores Canadian liability concerns flowing from the integration of artificial intelligence (AI) in health care (HC) delivery. It argues that the current Canadian legal framework is sufficient, in most cases, to allow developers and users of AI technology to assess each stakeholder’s responsibility should the technology cause harm. Further, it inquires as to whether an alternative approach to existing liability regimes should be adopted in order to promote AI innovation based on recognized best practices which, in turn, could lead to increased use of AI technology.
    Keywords: AI,Digital Health,Law,Liability,Doctor,Hospitals,Companies,
    Date: 2020–04–16
  8. By: de Pedraza, Pablo; Vollbracht, Ian
    Abstract: This paper presents a theoretical conceptualization of the data economy that motivates more access to data for scientific research. It defines the semicircular flow of the data economy as analogous to the traditional circular flow of the economy. Knowledge extraction from large, inter-connected data sets displays natural monopoly characteristics, which favours the emergence of oligopolistic data holders that generate and disclose the amount of knowledge that maximizes their profit. If monopoly theory holds, this level of knowledge is below the socially desirable amount because data holders have incentives to maintain their market power. The analogy is further developed to include data leakages, data sharing policies, merit and demerit knowledge, and knowledge injections. It draws a data sharing Laffer curve that defines optimal data sharing as the point where the production of merit knowledge is maximized. The theoretical framework seems to describe many features of the data-intensive economy of today, in which large-scale data holders specialize in extraction of knowledge from the data they hold. Conclusions support the use of policies to enhance data sharing and, or, enhanced user-centric data property rights to facilitate data flows in a manner that would increase merit knowledge generation up to the socially desirable amount.
    Keywords: Big data,Artificial Intelligence,Government
    JEL: H1 O38 P48
    Date: 2020
  9. By: Jennifer Castle; Jurgen Doornik; David Hendry
    Abstract: Abstract: Seeking substantive relationships among vast numbers of spurious connections when modelling Big Data requires an appropriate approach. Big Data are useful if they can increase the probability that the data generation process is nested in the postulated model, increase the power of specification and mis-specification tests, and yet do not raise the chances of adventitious significance. Simply choosing the best-fitting equation or trying hundreds of empirical fits and selecting a preferred one–perhaps contradicted by others that go unreported–is not going to lead to a useful outcome. Wide-sense non-stationarity (including both distributional shifts and integrated data) must be taken into account. The paper discusses the use of principal components analysis to identify cointegrating relations as a route to handling that aspect of non-stationary big data, along with saturation to handle distributional shifts, and models the monthly UK unemployment rate, using both macroeconomic and Google Trends data, searching over 3000 explanatory variables and yet identifying a parsimonious, well-specified and theoretically interpretable model specification.
    Keywords: Cointegration; Big Data; Model Selection; Outliers; Indicator Saturation; Autometrics
    JEL: C51 Q54
    Date: 2020–04–15
  10. By: Brunori, Paolo; Neidhöfer, Guido
    Abstract: We show that measures of inequality of opportunity (IOP) fully consistent with Roemer (1998)'s IOP theory can be straightforwardly estimated by adopting a machine learning approach, and apply our novel method to analyse the development of IOP in Germany during the last three decades. Hereby, we take advantage of information contained in 25 waves of the Socio-Economic Panel. Our analysis shows that in Germany IOP declined immediately after reunification, increased in the first decade of the century, and slightly declined again after 2010. Over the entire period, at the top of the distribution we always find individuals that resided in West-Germany before the fall of the Berlin Wall, whose fathers had a high occupational position, and whose mothers had a high educational degree. East-German residents in 1989, with low educated parents, persistently qualify at the bottom.
    Keywords: Inequality,Opportunity,SOEP,Germany
    JEL: D63 D30 D31
    Date: 2020
  11. By: Nataliia Ostapenko
    Abstract: I propose a new approach to identifying exogenous monetary policy shocks that requires no priors on the underlying macroeconomic structure, nor any observation of monetary policy actions. My approach entails directly estimating the unexpected changes in the federal funds rate as those which cannot be predicted from the internal Federal Open Market Committee's (FOMC) discussions. I employ deep learning and basic machine learning regressors to predict the effective federal funds rate from the FOMC's discussions without imposing any time-series structure. The result of the standard three variable Structural Vector Autoregression (SVAR) with my new measure shows that economic activity and inflation decline in response to a monetary policy shock.
    Keywords: monetary policy, identification, shock, deep learning, FOMC, transcripts
    Date: 2020
  12. By: Ari, Anil; Ratnovski, Lev; Chen, Sophia
    Abstract: This paper presents a new dataset on the dynamics of non-performing loans (NPLs) during 88 banking crises since 1990. The data show similarities across crises during NPL build-ups but less so during NPL resolutions. We find a close relationship between NPL problems—elevated and unresolved NPLs—and the severity of post-crisis recessions. A machine learning approach identifies a set of pre-crisis predictors of NPL problems related to weak macroeconomic, institutional, corporate, and banking sector conditions. Our findings suggest that reducing pre-crisis vulnerabilities and promptly addressing NPL problems during a crisis are important for post-crisis output recovery. JEL Classification: E32, E44, G21, N10, N20
    Keywords: banking crises, crisis resolution, debt, non-performing loans, recessions
    Date: 2020–04
  13. By: Bachev, Hrabrin
    Abstract: Despite its big theoretical and practical importance in Bulgaria there are no comprehensive analysis of the state and evolution of digitalization in agriculture and rural areas. The goal of this study is to analyze the state, development and efficiency of digitalization in the agrarian sphere in Bulgaria, specify major trends in that area, compare the situation with other EU countries, identify main problems, and make recommendation for improving policies in the next programing period. Analysis has found out that in recent years there is considerable improvement of the access of Bulgarian households to internet as well as a significant increase in the persons using internet for relations with public institutions and trading goods and services. Nevertheless, Bulgaria is quite behind from other EU members in regards to introduction of digital technologies in the economy and society taking one of the last places in EU in terms of Integral Index for Introduction of Digital Technologies in the Economy and Society – DESI. There is a great variation on the extent of digitalization in different subsectors of agriculture, farms of different juridical type and size, and different regions of the country. Most agricultural holdings are not aware with the content of digital agriculture as 14% apply modern digital technologies. Major obstacles for introduction of digital technologies are qualification of employees, amount of required investment, unclear economic benefits, and data security. Main areas where state administration actions are required are: support of measures for supplementary training of labor, tax preferences in planning of actions and digitalization of activity, stimulation of young specialists, introduction of internationally recognized processes of standardization and certification, adaptation of legislation in the area of data protection, and securing reliable and high speed networks.
    Keywords: digitalisation, agriculture, rural, Bulgaria, AKIS, EU CAP
    JEL: Q1 Q12 Q13 Q15 Q16 Q18
    Date: 2020–03
  14. By: Amanda Bayer; Gregory Bruich; Raj Chetty; Andrew Housiaux
    Abstract: There is widespread concern that economics does not attract as broad or diverse a pool of talent as it could. For example, less than one-third of undergraduates who receive degrees in economics are women, significantly lower than in math or statistics. This article presents a case study of a new introductory undergraduate course at Harvard, “Using Big Data to Solve Economic and Social Problems,” that enrolled 400 students, achieved nearly a 50-50 gender balance, and was among the highest-rated courses in the college. We first summarize the course’s content and pedagogical approach. We then illustrate how this approach differs from that taken in traditional courses by showing how canonical topics – income inequality, tax incidence, and adverse selection – are taught differently. Then, drawing upon students’ comments and prior research on effective teaching practices, we identify elements of the course’s approach that appear to underlie its success: connecting the material to students’ own experiences; teaching skills that have social and career value; and engaging students in scientific investigation of real-world problems. We conclude by discussing how these ideas for improving instruction in economics could be applied in other courses and tested empirically in future research.
    JEL: A2
    Date: 2020–04
  15. By: Kenney, Martin; Serhan, Hiam; Trystram, Gilles
    Abstract: Abstract Technologies such as digitally-equipped agricultural equipment, drones, image recognition, sensors, robots and artificial intelligence are being rapidly adopted throughout the agrifood system. As a result, actors in the system are generating and using ever more data. While this is already contributing to greater productivity, efficiency, and resilience, for the most part, this data has been siloed at its production sites whether on the farm or at the other nodes in the system. Sharing this data can be used to create value at other nodes in the system by increasing transparency, traceability, and productivity. Ever greater connectivity allows the sharing of this data with actors, at the same node in the value chain, e.g., farmer-to-farmer, or between different nodes in the value chain, e.g., farmer-to-equipment producer. The benefits of data sharing for efficiency, productivity and sustainability are predicated upon the adoption of an online digital platform. The conundrum is that, as the intermediary, the owner of a successful platform acquires significant power in relationship to the platform sides. This paper identifies five types of platform business models/ownership arrangements and their benefits and drawbacks for the various actors in the agri-food system and, in particular farmers. The types discussed are: 1) venture capital financed startups; 2) existing agro-food industry firms including equipment makers such as John Deere, agrochemical/seed conglomerates such as Bayer/Monsanto, and agricultural commodity traders such as ADM and Cargill; 3) agricultural cooperative such as InVivo in France; 4) various specially formed consortia of diverse sets of agri-food system actors including farmers, and 5) the internet giants such as Amazon, Microsoft and Google. The paper assesses the business models for each of these organizational forms. Finally, we describe the drawbacks each of these organizational forms have experienced as they attempt to secure adoption of their particular platform solution.
    Keywords: Digitization, Platform Economy, Agriculture, Agri-food systems, Cooperatives, Platforms
    JEL: Q1 Q13 L6 L66
    Date: 2020–04–23
  16. By: Jennifer L. Castle (Dept of Economics, Institute for New Economic Thinking at the Oxford Martin School and Magdalen College, University of Oxford); Jurgen A. Doornik (Dept of Economics, Institute for New Economic Thinking at the Oxford Martin School and Climate Econometrics, Nuffield College, University of Oxford); David F. Hendry (Dept of Economics, Institute for New Economic Thinking at the Oxford Martin School and Climate Econometrics, Nuffield College, University of Oxford)
    Abstract: Since complete and correct a priori specifications of models for observational data never exist, model selection is unavoidable in that context. The target of selection needs to be the process generating the data for the variables under analysis, while retaining the objective of the study, often a theorybased formulation. Successful selection requires robustness against many potential problems jointly, including outliers and shifts; omitted variables; incorrect distributional shape; non-stationarity; misspecified dynamics; and non-linearity, as well as inappropriate exogeneity assumptions. The aim is to seek parsimonious final representations that retain the relevant information, are well specified, encompass alternative models, and evaluate the validity of the study. Our approach to doing so inevitably leads to more candidate variables than observations, handled by iteratively switching between contracting and expanding multi-path searches, here programmed in Autometrics. We investigate the ability of indicator saturation to discriminate between measurement errors and outliers, between outliers and large observations arising from non-linear responses (illustrated by artificial data), and apparent outliers due to alternative distributional assumptions. We illustrate the approach by exploring empirical models of the Boston housing market and inflation for the UK (both tackling outliers and non-linearities that can distort other estimation methods). We re-analyze the ‘local instability’ in the robust method of least median of squares shown by Hettmansperger and Sheather (1992) using indicator saturation to explain their findings.
    Keywords: Model Selection; Robustness; Outliers; Location Shifts; Indicator Saturation; Autometrics.
    JEL: C51 C22
    Date: 2020–04–15
  17. By: Kaiser, Ulrich; Kuhn, Johan M.
    Abstract: Can publicly available, web-scraped data be used to identify promising business startups at an early stage? To answer this question, we use such textual and non-textual information about the names of Danish firms and their addresses as well as their business purpose statements (BPSs) supplemented by core accounting information along with founder and initial startup characteristics to forecast the performance of newly started enterprises over a five years' time horizon. The performance outcomes we consider are involuntary exit, above{average employment growth, a return on assets of above 20 percent, new patent applications and participation in an innovation subsidy program. Our first key finding is that our models predict startup performance with either high or very high accuracy with the exception of high returns on assets where predictive power remains poor. Our second key finding is that the data requirements for predicting performance outcomes with such accuracy are low. To forecast the two innovation-related performance outcomes well, we only need to include a set of variables derived from the BPS texts while an accurate prediction of startup survival and high employment growth needs the combination of (i) information derived from the names of the startups, (ii) data on elementary founder-related characteristics and (iii) either variables describing the initial characteristics of the startup (to predict startup survival) or business purpose statement information (to predict high employment growth). These sets of variables are easily obtainable since the underlying information is mandatory to report upon business registration. The substantial accuracy of our predictions for survival, employment growth, new patents and participation in innovation subsidy programs indicates ample scope for algorithmic scoring models as an additional pillar of funding and innovation support decisions.
    Keywords: startup,performance,prediction,text as data
    JEL: L26 C53
    Date: 2020

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.