nep-big New Economics Papers
on Big Data
Issue of 2020‒01‒20
sixteen papers chosen by
Tom Coupé
University of Canterbury

  1. Which Night Lights Data Should we Use in Economics, and Where? By Gibson, John; Olivia, Susan; Boe-Gibson, Geua
  2. Completing the Market: Generating Shadow CDS Spreads by Machine Learning By Nan Hu; Jian Li; Alexis Meyer-Cirkel
  3. 151 Estrategias de Trading (151 Trading Strategies) By Zura Kakushadze; Juan Andr\'es Serur
  4. Merger Policy in Digital Markets: An Ex-Post Assessment By Elena Argentesi; Paolo Buccirossi; Emilio Calvano; Tomaso Duso; Alessia Marrazzo; Salvatore Nava
  5. Big Data on Vessel Traffic: Nowcasting Trade Flows in Real Time By Serkan Arslanalp; Marco Marini; Patrizia Tumbarello
  6. We predict conflict better than we thought! Taking time seriously when evaluating predictions in Binary-Time-Series-Cross-Section-Data By Çiflikli, Gökhan; Metternich, Nils W
  7. Regulation of AI Technologies in the Construction Industry By Vishnu Sivarudran Pillai; Kira Matus
  8. Artificial Intelligence in Health Care: A Report from the National Academy of Medicine By Michael E. Matheny; Danielle Whicher; Sonoo Thadaney Israni
  9. Debt Is Not Free By Marialuz Moreno Badia; Paulo Medas; Pranav Gupta; Yuan Xiang
  10. Adaptive Trees: a new approach to economic forecasting By Nicolas Woloszko
  11. The Morphology and Circuity of Walkable and Drivable Street Networks By Boeing, Geoff
  12. An empirical study of neural networks for trend detection in time series By Alexandre Miot; Gilles Drigout
  13. Cultural evolution of emotional expression in 50 years of song lyrics By Brand, Charlotte Olivia; Acerbi, Alberto; Mesoudi, Alex
  14. Straw Burning, PM2.5 and Death: Evidence from China By Guojun He; Tony Liu; Maigeng Zhou
  15. Análisis de Sentimiento Basado en el Informe de Percepciones de Negocios del Banco Central de Chile By María del Pilar Cruz; Hugo Peralta; Bruno Ávila
  16. Street Network Models and Measures for Every U.S. City, County, Urbanized Area, Census Tract, and Zillow-Defined Neighborhood By Boeing, Geoff

  1. By: Gibson, John; Olivia, Susan; Boe-Gibson, Geua
    Abstract: Popular DMSP night lights data are flawed by blurring, top-coding, and lack of calibration. Yet newer and better VIIRS data are rarely used in economics. We compare these two data sources for predicting Indonesian GDP at the second sub-national level. DMSP data are a bad proxy for GDP outside of cities. The city lights-GDP relationship is twice as noisy using DMSP as using VIIRS. Spatial inequality is considerably understated with DMSP data. A Pareto adjustment to correct for top-coding in DMSP data has a modest effect but still understates spatial inequality and misses key features of economic activity in Jakarta.
    Keywords: Night lights; inequality; GDP; DMSP; VIIRS; Indonesia
    JEL: O15 R12
    Date: 2019–12–14
  2. By: Nan Hu; Jian Li; Alexis Meyer-Cirkel
    Abstract: We compared the predictive performance of a series of machine learning and traditional methods for monthly CDS spreads, using firms’ accounting-based, market-based and macroeconomics variables for a time period of 2006 to 2016. We find that ensemble machine learning methods (Bagging, Gradient Boosting and Random Forest) strongly outperform other estimators, and Bagging particularly stands out in terms of accuracy. Traditional credit risk models using OLS techniques have the lowest out-of-sample prediction accuracy. The results suggest that the non-linear machine learning methods, especially the ensemble methods, add considerable value to existent credit risk prediction accuracy and enable CDS shadow pricing for companies missing those securities.
    Date: 2019–12–27
  3. By: Zura Kakushadze; Juan Andr\'es Serur
    Abstract: This book, which is in Spanish, provides detailed descriptions, including over 550 mathematical formulas, for over 150 trading strategies across a host of asset classes (and trading styles). This includes stocks, options, fixed income, futures, ETFs, indexes, commodities, foreign exchange, convertibles, structured assets, volatility (as an asset class), real estate, distressed assets, cash, cryptocurrencies, miscellany (such as weather, energy, inflation), global macro, infrastructure, and tax arbitrage. Some strategies are based on machine learning algorithms (such as artificial neural networks, Bayes, k-nearest neighbors). We also give: source code for illustrating out-of-sample backtesting with explanatory notes; around 2,000 bibliographic references; and over 900 glossary, acronym and math definitions. The presentation is intended to be descriptive and pedagogical. ----- Este libro proporciona descripciones detalladas, que incluyen m\'as de 550 f\'ormulas matem\'aticas, para m\'as de 150 estrategias de trading para una gran cantidad de clases de activos y estilos de trading. Esto incluye acciones, opciones, bonos (renta fija), futuros, ETFs, \'indices, commodities, divisas, bonos convertibles, activos estructurados, volatilidad (como clase de activos), bienes inmuebles, activos en distress, efectivo, criptomonedas, miscel\'aneos (como clima, energ\'ia, inflaci\'on), macro global, infraestructura y arbitraje impositivo. Algunas estrategias se basan en algoritmos de aprendizaje autom\'atico (como redes neuronales artificiales, Bayes, k vecinos m\'as cercanos). El libro tambi\'en incluye c\'odigo para backtesting fuera de la muestra con notas explicativas; cerca de 2,000 referencias bibliogr\'aficas; m\'as de 900 t\'erminos que comprenden el glosario, acr\'onimos y definiciones matem\'aticas. La presentaci\'on pretende ser descriptiva y pedag\'ogica.
    Date: 2019–11
  4. By: Elena Argentesi; Paolo Buccirossi; Emilio Calvano; Tomaso Duso; Alessia Marrazzo; Salvatore Nava
    Abstract: This paper presents a broad retrospective evaluation of mergers and merger decisions in the digital sector. We first discuss the most crucial features of digital markets such as network effects, multi-sidedness, big data, and rapid innovation that create important challenges for competition policy. We show that these features have been key determinants of the theories of harm in major merger cases in the past few years. We then analyse the characteristics of almost 300 acquisitions carried out by three major digital companies –Amazon, Facebook, and Google –between 2008 and 2018. We cluster target companies on their area of economic activity and show that they span a wide range of economic sectors. In most cases, their products and services appear to be complementary to those supplied by the acquirers. Moreover, target companies seem to be particularly young, being four-years-old or younger in nearly 60% of cases at the time of the acquisition. Finally, we examine two important merger cases, Facebook/Instagram and Google/Waze, providing a systematic assessment of the theories of harm considered by the UK competition authorities as well as evidence on the evolution of the market after the transactions were approved. We discuss whether the CAs performed complete and careful analyses to foresee the competitive consequences of the investigated mergers and whether a more effective merger control regime can be achieved within the current legal framework.
    Keywords: digital markets, mergers, network effects, big data, platforms, ex-post, antitrust
    JEL: L40 K21
    Date: 2019
  5. By: Serkan Arslanalp; Marco Marini; Patrizia Tumbarello
    Abstract: Vessel traffic data based on the Automatic Identification System (AIS) is a big data source for nowcasting trade activity in real time. Using Malta as a benchmark, we develop indicators of trade and maritime activity based on AIS-based port calls. We test the quality of these indicators by comparing them with official statistics on trade and maritime statistics. If the challenges associated with port call data are overcome through appropriate filtering techniques, we show that these emerging “big data” on vessel traffic could allow statistical agencies to complement existing data sources on trade and introduce new statistics that are more timely (real time), offering an innovative way to measure trade activity. That, in turn, could facilitate faster detection of turning points in economic activity. The approach could be extended to create a real-time worldwide indicator of global trade activity.
    Date: 2019–12–13
  6. By: Çiflikli, Gökhan (London School of Economics and Political Science); Metternich, Nils W (University College London)
    Abstract: Efforts to predict civil war onset, its duration, and subsequent peace have dramatically increased. Nonetheless, by standard classification metrics the discipline seems to make little progress. Some remedy is promised by particular cross-validation strategies and machine learning tools, which increase accuracy rates substantively. However, in this research note we provide convincing evidence that the predictive performance of conflict models has been much better than previously assessed. We demonstrate that standard classification metrics for binary outcome data are prone to underestimate model performance in a binary-time-series-cross-section context. We argue for temporal residual based metrics to evaluate cross-validation efforts in binary-time-series-cross-section and test these in Monte Carlo experiments and existing empirical studies.
    Date: 2019–03–13
  7. By: Vishnu Sivarudran Pillai (Chief Economist for Asia Pacific, NATIXIS, Department of Public Policy, The Hong Kong University of Science and Technology); Kira Matus (Associate Professor for the Division of Social Science, Associate Professor for Division of Public Policy, Department of Economics & Institute for Emerging Market Studies, the Hong Kong University of Science and Technology)
    Abstract: The development of Artificial Intelligence (AI) -based technologies for the construction industry, though not as advanced as in some areas, is progressing. The degree of automation in construction is anticipated to eventually lead to humanoid robots and autonomous back loaders or cranes operating at construction sites. The prospect of a highly automated construction industry is a medium-term future prospect. Hence it is imperative to proactively understand the regulatory gaps, to support policy interventions to mitigate potential risks. Regulation of futuristic technologies like AI is challenging in sectors where there is a lack of adequate tacit and applied knowledge. AI regulation is complicated by the massiveness of the construction industry, characterized by a broad spectrum of actors and activities. We propose a framework to understand the AI inclusion in the construction industry and identification of risks and regulatory gaps by considering the diverse stakeholders and their risk perception.
    Date: 2019–05
  8. By: Michael E. Matheny; Danielle Whicher; Sonoo Thadaney Israni
    Abstract: The promise of artificial intelligence (AI) in health care offers substantial opportunities to improve patient and clinical team outcomes, reduce costs, and influence population health.
    Keywords: artificial intelligence, health care, patient, clinical, population health
  9. By: Marialuz Moreno Badia; Paulo Medas; Pranav Gupta; Yuan Xiang
    Abstract: With public debt soaring across the world, a growing concern is whether current debt levels are a harbinger of fiscal crises, thereby restricting the policy space in a downturn. The empirical evidence to date is however inconclusive, and the true cost of debt may be overstated if interest rates remain low. To shed light into this debate, this paper re-examines the importance of public debt as a leading indicator of fiscal crises using machine learning techniques to account for complex interactions previously ignored in the literature. We find that public debt is the most important predictor of crises, showing strong non-linearities. Moreover, beyond certain debt levels, the likelihood of crises increases sharply regardless of the interest-growth differential. Our analysis also reveals that the interactions of public debt with inflation and external imbalances can be as important as debt levels. These results, while not necessarily implying causality, show governments should be wary of high public debt even when borrowing costs seem low.
    Date: 2020–01–03
  10. By: Nicolas Woloszko
    Abstract: The present paper develops Adaptive Trees, a new machine learning approach specifically designed for economic forecasting. Economic forecasting is made difficult by economic complexity, which implies non-linearities (multiple interactions and discontinuities) and unknown structural changes (the continuous change in the distribution of economic variables). The forecast methodology aims at addressing these challenges. The algorithm is said to be “adaptive” insofar as it adapts to the quantity of structural change it detects in the economy by giving more weight to more recent observations. The performance of the algorithm in forecasting GDP growth 3- to 12-months ahead is assessed through simulations in pseudo-real-time for six major economies (USA, UK, Germany, France, Japan, Italy). The performance of Adaptive Trees is on average broadly similar to forecasts obtained from the OECD’s Indicator Model and generally performs better than a simple AR(1) benchmark model as well as Random Forests and Gradient Boosted Trees.
    Keywords: business cycles, concept drift, feature engineering, forecasting, GDP growth, interpretable AI, machine learning, short-term forecasts, structural change
    JEL: C01 C18 C23 C45 C53 C63 E37
    Date: 2020–01–16
  11. By: Boeing, Geoff (Northeastern University)
    Abstract: Circuity, the ratio of network distances to straight-line distances, is an important measure of urban street network structure and transportation efficiency. Circuity results from a circulation network's configuration, planning, and underlying terrain. In turn, it impacts how humans use urban space for settlement and travel. Although past research has examined overall street network circuity, researchers have not studied the relative circuity of walkable versus drivable circulation networks. This study uses OpenStreetMap data to explore relative network circuity. We download walkable and drivable networks for 40 US cities using the OSMnx software, which we then use to simulate four million routes and analyze circuity to characterize network structure. We find that walking networks tend to allow for more direct routes than driving networks do in most cities: average driving circuity exceeds average walking circuity in all but four of the cities that exhibit statistically significant differences between network types. We discuss various reasons for this phenomenon, illustrated with case studies. Network circuity also varies substantially between different types of places. These findings underscore the value of using network-based distances and times rather than straight-line when studying urban travel and access. They also suggest the importance of differentiating between walkable and drivable circulation networks when modeling and characterizing urban street networks: although different modes' networks overlap in any given city, their relative structure and performance vary in most cities.
    Date: 2019–01–28
  12. By: Alexandre Miot; Gilles Drigout
    Abstract: Detecting structure in noisy time series is a difficult task. One intuitive feature is the notion of trend. From theoretical hints and using simulated time series, we empirically investigate the efficiency of standard recurrent neural networks (RNNs) to detect trends. We show the overall superiority and versatility of certain standard RNNs structures over various other estimators. These RNNs could be used as basic blocks to build more complex time series trend estimators.
    Date: 2019–12
  13. By: Brand, Charlotte Olivia (University of Exeter); Acerbi, Alberto; Mesoudi, Alex (University of Exeter)
    Abstract: The cultural dynamics of music has recently become a popular avenue of research in the field of cultural evolution, reflecting a growing interest in art and popular culture more generally. Just as biologists seek to explain population-level trends in genetic evolution in terms of micro-evolutionary processes such as selection, drift and migration, cultural evolutionists have sought to explain population-level cultural phenomena in terms of underlying social, psychological and demographic factors. Primary amongst these factors are learning biases, describing how cultural items are socially transmitted from person to person. As big datasets become more openly available and workable, and statistical modelling techniques become more powerful, efficient and user-friendly, describing population-level dynamics in terms of simple, individual-level learning biases is becoming more feasible. Here we test for the presence of learning biases in two large datasets of popular song lyrics dating from 1965-2015. We find some evidence of content bias, prestige bias and success bias in the proliferation of negative lyrics, and suggest that negative expression of emotions in music, and perhaps art generally, provides an avenue for people to not only process and express their own negative emotions, but also benefit from the knowledge that prestigious others experience similarly negative emotions as they do.
    Date: 2019–01–17
  14. By: Guojun He (Division of Social Science, Division of Environment and Sustainability, Department of Economics, The Hong Kong University of Science and Technology.); Tony Liu (Division of Social Science, The Hong Kong University of Science and Technology.); Maigeng Zhou (National Center for Chronic and Non-Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention.)
    Abstract: This study uses satellite data to detect agricultural straw burnings and estimates its impact on air pollution and health in China. We find that straw burning increases particulate matter pollution and causes people to die from cardio-respiratory diseases. Middle-aged and old people in rural areas are particularly sensitive to straw burning pollution. We estimate that a 10μg/m3 increase in PM2.5 will increase mortality by 3.25%. Subsidizing the recycling of straw brings significant health benefits and is estimated to avert 21,400 pre-mature deaths annually.
    Date: 2019–05
  15. By: María del Pilar Cruz; Hugo Peralta; Bruno Ávila
    Abstract: Using the texts of the Business Perceptions Report published quarterly by the Central Bank of Chile, we construct a numerical index that reflects the emotional feeling or tone contained in the documents. For the construction of the index, we use the Sentiment Analysis (SA) or opinion mining methodology to extract the sentiment orientation of the documents, taking into account the positive or negative contextual polarity of their language. The results show that the IPN index has a high and significant correlation with various indices referring to business confidence and economic expectations in the medium term. The correlation with quantitative indicators of activity such as GDP growth, consumption or investment, is lower but still significant. The main contribution of this work is the formulation of a dictionary in Spanish language for SA and the generation of a numerical index through the application of the Sentiment Analysis methodology.
    Date: 2020–01
  16. By: Boeing, Geoff (Northeastern University)
    Abstract: OpenStreetMap provides a valuable crowd-sourced database of raw geospatial data for constructing models of urban street networks for scientific analysis. This paper reports results from a research project that collected raw street network data from OpenStreetMap using the Python-based OSMnx software for every U.S. city and town, county, urbanized area, census tract, and Zillow-defined neighborhood. It constructed nonplanar directed multigraphs for each and analyzed their structural and morphological characteristics. The resulting data repository contains over 110,000 processed, cleaned street network graphs (which in turn comprise over 55 million nodes and over 137 million edges) at various scales—comprehensively covering the entire U.S.—archived as reusable open-source GraphML files, node/edge lists, and GIS shapefiles that can be immediately loaded and analyzed in standard tools such as ArcGIS, QGIS, NetworkX, graph-tool, igraph, or Gephi. The repository also contains measures of each network’s metric and topological characteristics common in urban design, transportation planning, civil engineering, and network science. No other such dataset exists. These data offer researchers and practitioners a new ability to quickly and easily conduct graph-theoretic circulation network analysis anywhere in the U.S. using standard, free, open-source tools.
    Date: 2019–03–01

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.