nep-big New Economics Papers
on Big Data
Issue of 2020‒08‒10
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. Machine learning classification of entrepreneurs in British historical census data By Montebruno, Piero; Bennett, Robert; Smith, Harry; van Lieshout, Carry
  2. A Study on the Impact of Artificial Intelligence on Project Management By Belharet, Adel; Bharathan, Urmila; Dzingina, Benjamin; Madhavan, Neha; Mathur, Charul; Toti, Yves-Daniel Boga; Babbar, Divij; Markowski, Krzysztof
  3. A Note on the Interpretability of Machine Learning Algorithms By Dominique Guégan
  4. Serie de Machine Learning. Revisión de Algebra Lineal 1 By Sergio A. Pernice
  5. Services Trade Policies and Economic Integration: New Evidence for Developing Countries By Hoekman, Bernard; Shepherd, Ben
  6. Night Lights in Economics: Sources and Uses By John Gibson; Susan Olivia; Geua Boe-Gibson
  7. Forecasting Singapore GDP using the SPF data By Xie, Tian; Yu, Jun
  8. AI Watch 2019 Activity Report By Blagoj Delipetrev; Chrisa Tsinaraki; Daniel Nepelski; Emilia Gomez Gutierrez; Fernando Martinez Plumed; Gianluca Misuraca; Giuditta De Prato; Karen Fullerton; Massimo Craglia; Nestor Duch-Brown; Stefano Nativi; Vincent Van Roy
  9. Debt Is Not Free By Marialuz Moreno Badia; Paulo Medas; Pranav Gupta; Yuan Xiang
  10. Selling the circularity: Investigating the impact of circularity promotion on the performance of Italian manufacturing companies By Silvia Blasi; Benedetta Crisafulli; Silvia Rita Sedita
  11. Valuing Private Equity Strip by Strip By Gupta, Arpit; van Nieuwerburgh, Stijn
  12. Investment sizing with deep learning prediction uncertainties for high-frequency Eurodollar futures trading By Trent Spears; Stefan Zohren; Stephen Roberts
  13. How Do Member Countries Receive IMF Policy Advice: Results from a State-of-the-art Sentiment Index By Ghada Fayad; Chengyu Huang; Yoko Shibuya; Peng Zhao
  14. Illuminating Economic Growth By Yingyao Hu; Jiaxiong Yao
  15. Better Night Lights Data, For Longer By John Gibson
  16. The unintended impact of Colombia's covid-19 lockdown on forest fires By Amador-Jiménez, Mónica; Millner, Naomi; Palmer, Charles; Pennington, R. Toby; Sileci, Lorenzo
  17. Industrial pattern and robot adoption in European regions By Massimiliano Nuccio; Marco Guerzoni; Riccardo Cappelli; Aldo Geuna
  18. How Unequal is Europe? Evidence from Distributional National Accounts, 1980-2017 By Thomas Blanchet; Lucas Chancel; Amory Gethin

  1. By: Montebruno, Piero; Bennett, Robert; Smith, Harry; van Lieshout, Carry
    Abstract: This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851-1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.
    Keywords: machine learning; deep learning; logistic regression; classification; big data; census
    JEL: M13 N83
    Date: 2019–08–02
  2. By: Belharet, Adel (Korea University); Bharathan, Urmila; Dzingina, Benjamin; Madhavan, Neha; Mathur, Charul; Toti, Yves-Daniel Boga; Babbar, Divij; Markowski, Krzysztof
    Abstract: Artificial intelligence and machine learning have found a wide range of business applications, but their impact is only just starting to be seen in project management. This study explores how our existing PM profession will change to be more suitable to AI inputs; and how project management will be forced to change because of the advent of AI, along with concrete, succinct and precise recommendations backed by demonstrable reasoning.
    Date: 2020–06–29
  3. By: Dominique Guégan (Department of Economics, University Of Venice Cà Foscari; University Paris 1 Panthéon-Sorbonne; labEx ReFi Paris;)
    Abstract: We are interested in the analysis of the concept of interpretability associated with a ML algorithm. We distinguish between the “How”, i.e., how a black box or a very complex algorithm works, and the “Why”, i.e. why an algorithm produces such a result. These questions appeal to many actors, users, professions, regulators among others. Using a formal standardized framework, we indicate the solutions that exist by specifying which elements of the supply chain are impacted when we provide answers to the previous questions. This presentation, by standardizing the notations, allows to compare the different approaches and to highlight the specificities of each of them: both their objective and their process. The study is not exhaustive and the subject is far from being closed.
    Keywords: Agnostic models, Artificial Intelligence, Counterfactual approach, Interpretability, LIME method, Machine learning
    JEL: C K
    Date: 2020
  4. By: Sergio A. Pernice
    Abstract: En este documento presentamos una primera revisión de álgebra lineal de una forma especialmente adaptada para sus eventuales aplicaciones en aprendizaje automático (machine learning). Es el primero de una serie de documentos sobre machine learning en español. Es parte del contenido del curso “Métodos de Machine Learning para Economistas” de la Maestría en Economía de la UCEMA.
    Keywords: álgebra lineal, regresiones, machine learning, aprendizaje automático.
    Date: 2020–07
  5. By: Hoekman, Bernard; Shepherd, Ben
    Abstract: This paper provides the first quantitative evidence on the restrictiveness of services policies in 2016 for a sample of developing countries, based on recently released regulatory data collected by the World Bank and WTO. We use machine learning to recreate to a high degree of accuracy the OECD's Services Trade Restrictiveness Index (STRI), which takes account of nonlinearities and dependencies across measures. We use the resulting estimates to extend the OECD STRI approach to 23 additional countries, producing what we term a Services Policy Index (SPI). Converting the SPI to ad valorem equivalent terms shows that services policies are typically much more restrictive than tariffs on imports of goods, in particular in professional services and telecommunications. Developing countries tend to have higher services trade restrictions, but less so than has been found in research using data for the late 2000s. We show that the SPI has strong explanatory power for bilateral trade in services at the sectoral level, as well as for aggregate goods and services trade.
    Keywords: international trade; Machine Learning; restrictiveness indicators; services policies; Trade in Services
    JEL: F13 F15 O24
    Date: 2019–12
  6. By: John Gibson; Susan Olivia; Geua Boe-Gibson
    Abstract: Night lights, as detected by satellites, are increasingly used by economists, typically as a proxy for economic activity. The growing popularity of these data reflects either the absence, or the presumed inaccuracy, of more conventional economic statistics, like national or regional GDP. Further growth in use of night lights is likely, as they have been included in the AidData geo-query tool for providing sub-national data, and in geographic data that the Demographic and Health Survey links to anonymised survey enumeration areas. Yet this ease of obtaining night lights data may lead to inappropriate use, if users fail to recognize that most of the satellites providing these data were not designed to assist economists, and have features that may threaten validity of analyses based on these data, especially for temporal comparisons, and for small and rural areas. In this paper we review sources of satellite data on night lights, discuss issues with these data, and survey some of their uses in economics.
    Keywords: Density; Development DMSP Luminosity Night lights VIIRS
    JEL: O15 R12
    Date: 2020
  7. By: Xie, Tian (Shanghai University of Finance and Economics); Yu, Jun (School of Economics, Singapore Management University)
    Abstract: In this article, we use econometric methods, machine learning methods, and a hybrid method to forecast the GDP growth rate in Singapore based on the Survey of Professional Forecasters (SPF). We compare the performance of these methods with the sample median used by the Monetary Authority of Singapore (MAS). It is shown that the relationship between the actual GDP growth rates and the forecasts from individual professionals is highly nonlinear and non-additive, making it hard for all linear methods and the sample median to perform well. It is found that the hybrid method performs the best, reducing the mean squared forecast error (MSFE) by about 50% relative to that of the sample median.
    Date: 2020–07–14
  8. By: Blagoj Delipetrev (European Commission - JRC); Chrisa Tsinaraki (European Commission - JRC); Daniel Nepelski (European Commission - JRC); Emilia Gomez Gutierrez (European Commission - JRC); Fernando Martinez Plumed (European Commission - JRC); Gianluca Misuraca (European Commission - JRC); Giuditta De Prato (European Commission - JRC); Karen Fullerton (European Commission - JRC); Massimo Craglia (European Commission - JRC); Nestor Duch-Brown (European Commission - JRC); Stefano Nativi (European Commission - JRC); Vincent Van Roy (European Commission - JRC)
    Abstract: This report provides an overview of AI Watch activities in 2019. AI Watch is the European Commission knowledge service to monitor the development, uptake and impact of Artificial Intelligence (AI) for Europe. As part of the European strategy on AI, the European Commission and the Member States published in December 2018 a "Coordinated Plan on Artificial Intelligence" on the development of AI in the EU. The Coordinated Plan mentions the role of AI Watch to monitor its implementation. AI Watch was launched in December 2018. It aims to monitor European Union's industrial, technological and research capacity in AI; AI national strategies and policy initiatives in the EU Member States; uptake and technical developments of AI; and AI use and impact in public services. AI Watch will also provide analyses of education and skills for AI; AI key technological enablers; data ecosystems; and social perspective on AI. AI Watch has a European focus within the global landscape, and works in coordination with Member States. In its first year AI Watch has developed and proposed methodologies for data collection and analysis in a wide scope of AI-impacted domains, and has presented new results that can already support policy making on AI in the EU. In the coming months AI Watch will continue collecting and analysing new information. All AI Watch results and analyses are published on the AI Watch public web portal ( AI Watch welcomes feedback. This report will be updated annually.
    Keywords: AI, AI_Watch, Artificial Intelligence, Observatory, Research, Innovation, Digital Economy, Digital Transformation, Robotics, Digital Skills, Digital Education, Public Services, Member States, National AI strategies.
    Date: 2020–07
  9. By: Marialuz Moreno Badia; Paulo Medas; Pranav Gupta; Yuan Xiang
    Abstract: With public debt soaring across the world, a growing concern is whether current debt levels are a harbinger of fiscal crises, thereby restricting the policy space in a downturn. The empirical evidence to date is however inconclusive, and the true cost of debt may be overstated if interest rates remain low. To shed light into this debate, this paper re-examines the importance of public debt as a leading indicator of fiscal crises using machine learning techniques to account for complex interactions previously ignored in the literature. We find that public debt is the most important predictor of crises, showing strong non-linearities. Moreover, beyond certain debt levels, the likelihood of crises increases sharply regardless of the interest-growth differential. Our analysis also reveals that the interactions of public debt with inflation and external imbalances can be as important as debt levels. These results, while not necessarily implying causality, show governments should be wary of high public debt even when borrowing costs seem low.
    Keywords: Domestic debt;Financial statistics;Public debt;Negative interest rates;Economic analysis;crisis,debt,default,fiscal,machine learning,WP,fiscal crisis,predictor,income group,debt level,Reinhart
    Date: 2020–01–03
  10. By: Silvia Blasi (Department of Economics and Management, University of Padova); Benedetta Crisafulli (Department of Management, Birkbeck, University of London); Silvia Rita Sedita (Department of Economics and Management, University of Padova)
    Abstract: Promoting the circularity of business practices and of product offerings represents a pivotal process in increasing the value of circular products and encouraging the market to recognize such a value. This study investigates the communication abilities of companies manifesting an interest in adopting circular economy practices, with the aim to assess the extent to which promoting circularity increases economic performance. Employing a unique web-scraped dataset of Italian circular companies’ websites, we captured and analyzed the online promotional efforts of a unique sample of manufacturing companies. Underpinned by the signaling theory, our estimation results illustrate that the ability of small and medium-sized enterprises (SMEs) to signal the circularity of their business practices on the website generally increases performance and such impact is larger among low performing companies. Our study advances knowledge on: 1) the impact of promoting circularity on economic performance, 2) the efficacy of signaling in the context of circular practices’ adoption.
    Keywords: circular economy, big data, web scraping, signaling, communication, sustainability
    JEL: M10 M31
    Date: 2020–07
  11. By: Gupta, Arpit; van Nieuwerburgh, Stijn
    Abstract: We propose a new valuation method for private equity investments. First, we construct a cash-flow replicating portfolio for the private investment, applying Machine Learning techniques on cash-flows on various listed equity and fixed income instruments. The second step values the replicating portfolio using a flexible asset pricing model that accurately prices the systematic risk in bonds of different maturities and a broad cross-section of equity factors. The method delivers a measure of the risk-adjusted profit earned on a PE investment and a time series for the expected return on PE fund categories. We apply the method to buyout, venture capital, real estate, and infrastructure funds, among others. Accounting for horizon-dependent risk and exposure to a broad cross-section of equity factors results in negative average risk-adjusted profits. Substantial cross-sectional variation and persistence in performance suggests some funds outperform. We also find declining expected returns on PE funds in the later part of the sample.
    Keywords: affine asset pricing models; Buyout; cross-section of returns; infrastructure; Natural resources; private equity; real estate; temporal pricing of risk; Valuation; venture capital
    JEL: G12 G24
    Date: 2019–12
  12. By: Trent Spears; Stefan Zohren; Stephen Roberts
    Abstract: In this work we show that prediction uncertainty estimates gleaned from deep learning models can be useful inputs for influencing the relative allocation of risk capital across trades. In this way, consideration of uncertainty is important because it permits the scaling of investment size across trade opportunities in a principled and data-driven way. We showcase this insight with a prediction model and find clear outperformance based on a Sharpe ratio metric, relative to trading strategies that either do not take uncertainty into account, or that utilize an alternative market-based statistic as a proxy for uncertainty. Of added novelty is our modelling of high-frequency data at the top level of the Eurodollar Futures limit order book for each trading day of 2018, whereby we predict interest rate curve changes on small time horizons. We are motivated to study the market for these popularly-traded interest rate derivatives since it is deep and liquid, and contributes to the efficient functioning of global finance -- though there is relatively little by way of its modelling contained in the academic literature. Hence, we verify the utility of prediction models and uncertainty estimates for trading applications in this complex and multi-dimensional asset price space.
    Date: 2020–07
  13. By: Ghada Fayad; Chengyu Huang; Yoko Shibuya; Peng Zhao
    Abstract: This paper applies state-of-the-art deep learning techniques to develop the first sentiment index measuring member countries’ reception of IMF policy advice at the time of Article IV Consultations. This paper finds that while authorities of member countries largely agree with Fund advice, there is variation across country size, external openness, policy sectors and their assessed riskiness, political systems, and commodity export intensity. The paper also looks at how sentiment changes during and after a financial arrangement or program with the Fund, as well as when a country receives IMF technical assistance. The results shed light on key aspects on Fund surveillance while redefining how the IMF can view its relevance, value added, and traction with its member countries.
    Keywords: Fiscal sector;External sector;Economic conditions;Real sector;Commodity price indexes;IMF,Surveillance,Economic Policy,Sentiment Analysis,Natural Language Processing,WP,article IV,article IV consultation,paragraph,sector-specific
    Date: 2020–01–17
  14. By: Yingyao Hu; Jiaxiong Yao
    Abstract: This paper seeks to illuminate the uncertainty in official GDP per capita measures using auxiliary data. Using satellite-recorded nighttime lights as an additional measurement of true GDP per capita, we provide a statistical framework, in which the error in official GDP per capita may depend on the country’s statistical capacity and the relationship between nighttime lights and true GDP per capita can be nonlinear and vary with geographic location. This paper uses recently developed results for measurement error models to identify and estimate the nonlinear relationship between nighttime lights and true GDP per capita and the nonparametric distribution of errors in official GDP per capita data. We then construct more precise and robust measures of GDP per capita using nighttime lights, official national accounts data, statistical capacity, and geographic locations. We find that GDP per capita measures are less precise for middle and low income countries and nighttime lights can play a bigger role in improving such measures.
    Keywords: Low income countries;Economic growth;Development;Emerging markets;Technological innovation;Nighttime lights,measurement error,GDP per capita.,real GDP,optimal weight,income country,official measure,middle income country
    Date: 2019–04–09
  15. By: John Gibson
    Abstract: Night lights data are increasingly used in applied economics, almost always from- the Defense Meteorological Satellite Program (DMSP). These data are old, with- production ending in 2013, and are flawed by blurring, lack of calibration, and- top-coding. These inaccuracies in DMSP data cause mean-reverting errors. This- paper shows newer and better VIIRS night lights data have 80% higher predictive- power for real GDP in a cross-section of almost 300 European NUTS2 regions.- Spatial inequality is greatly understated with DMSP data, especially for the most- densely populated regions. A Pareto correction for top-coding of DMSP data has- a modest effect.
    Date: 2020
  16. By: Amador-Jiménez, Mónica; Millner, Naomi; Palmer, Charles; Pennington, R. Toby; Sileci, Lorenzo
    Abstract: The covid-19 pandemic led to rapid and large-scale government intervention in economies and societies. A common policy response to covid-19 outbreaks has been the lockdown or quarantine. Designed to slow the spread of the disease, lockdowns have unintended consequences for the environment. This article examines the impact of Colombia’s lockdown on forest fires, motivated by satellite data showing a particularly large upsurge of fires at around the time of lockdown implementation. We find that Colombia’s lockdown is associated with an increase in forest fires compared to three different counterfactuals, constructed to simulate the expected number of fires in the absence of the lockdown. To varying degrees across Colombia’s regions, the presence of armed groups is correlated with this fire upsurge. Mechanisms through which the lockdown might influence fire rates are discussed, including the mobilisation of armed groups and the reduction in the monitoring capacity of state and conservation organisations during the covid-19 outbreak. Given the fast-developing situation in Colombia, we conclude with some ideas for further research.
    Keywords: Covid-19; coronavirus; armed groups; Colombia; deforestation; forest fires; lockdown
    JEL: Q23 Q56 Q58
    Date: 2020–07
  17. By: Massimiliano Nuccio (BLISS – Digital Impact Lab, Department of Management, Università Ca' Foscari Venice); Marco Guerzoni (DESPINA Big Data Lab, Department of Economics and Statistics Cognetti De Martiis, University of Torino); Riccardo Cappelli (Department of Economics and Social Sciences, Polytechnic University of Marche); Aldo Geuna (Department of Culture, Politics and Society, University of Torino)
    Abstract: Recent literature on the diffusion of robots mostly ignores the regional dimension. The contribution of this paper at the debate on Industry 4.0 is twofold. First, IFR (2017) data on acquisitions of industrial robots in the five largest European economies are rescaled at regional levels to draw a first picture of winners and losers in the European race for advanced manufacturing. Second, using an unsupervised machine learning approach to classify regions based on their composition of industries. The paper provides novel evidence of the relationship between industry mix and the regional capability of adopting robots in the industrial processes.
    Keywords: Robots, Industry 4.0., Innovation, Industry Mix, Self-Organizing Maps
    JEL: E32 O33 R11 R12
    Date: 2020–07
  18. By: Thomas Blanchet (PSE - Paris School of Economics, WIL - World Inequality Lab); Lucas Chancel (PSE - Paris School of Economics, WIL - World Inequality Lab , IDDRI - Institut du Développement Durable et des Relations Internationales - Institut d'Études Politiques [IEP] - Paris); Amory Gethin (PSE - Paris School of Economics, WIL - World Inequality Lab)
    Abstract: This paper estimates the evolution of income inequality in 38 European countries from 1980 to 2017 by combining surveys, tax data and national accounts. We develop a harmonized methodology, using machine learning, nonlinear survey calibration and extreme value theory, in order to produce homogeneous pre-tax and post-tax income inequality estimates, comparable across countries and consistent with official national income growth rates. Inequalities have in- creased in a majority of European countries, both at the top and at the bottom of the distribution, especially between 1980 and 2000. The European top 1% grew more than two times faster than the bottom 50% and captured 17% of regional income growth. Relative poverty in Europe went through ups and downs, increasing from 20% in 1980 to 22% in 2017. Inequalities yet remain lower and have increased much less in Europe than in the US, despite the persistence of strong income differences between European countries and the weaker progressivity of European-wide income redistribution.
    Keywords: Simplified Distributional National Accounts,DINA,distribution,Inequality,Europe,pre-tax income,post-tax income,national income
    Date: 2019

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.