nep-big New Economics Papers
on Big Data
Issue of 2019‒02‒18
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. The added value of more accurate predictions for school rankings By Fritz Schiltz; Paolo Sestito; Tommaso Agasisti; Kristof De Witte
  2. Deciphering Monetary Policy Board Minutes through Text Mining Approach: The Case of Korea By Ki Young Park; Youngjoon Lee; Soohyon Kim
  3. Canada’s Monetary Policy Report: If Text Could Speak, What Would It Say? By André Binette; Dmitri Tchebotarev
  4. Linked Inventor Biography Data 1980-2014 : (INV-BIO ADIAB 8014) By Dorner, Matthias; Harhoff, Dietmar; Gaessler, Fabian; Hoisl, Karin; Poege, Felix
  5. Simultaneous inference for Best Linear Predictor of the Conditional Average Treatment Effect and other structural functions By Victor Chernozhukov; Vira Semenova
  6. Investigating the Enablers of Big Data Analytics on Sustainable Supply Chain By Lineth Rodríguez; Mihalis Giannakis; Catherine Da Cunha
  7. Machine learning in the service of policy targeting: the case of public credit guarantees By Monica Andini; Michela Boldrini; Emanuele Ciani; Guido de Blasio; Alessio D'Ignazio; Andrea Paladini
  8. Should We Care (More) About Data Aggregation? Evidence from the Democracy-Growth-Nexus. By Klaus Gründler; Tommy Krieger
  9. The Hardware-Software Model: A New Conceptual Framework of Production, R&D, and Growth with AI By Jakub Growiec
  10. What predicts corruption? By Colonnelli, E; Gallego, J.A.; Prem, M
  11. Can one improve now-casts of crop prices in Africa? Google can. By Weber, Regine; Lukas, Kornher
  12. High-performance stock index trading: making effective use of a deep LSTM neural network By Chariton Chalvatzis; Dimitrios Hristu-Varsakelis
  13. Countries’ perceptions of China’s Belt and Road Initiative: A big data analysis By Alicia García-Herrero; Jianwei Xu
  14. Big Data and Firm Dynamics By Maryam Farboodi; Roxana Mihet; Thomas Philippon; Laura Veldkamp
  15. Digital transformation and finance sector competition By Santiago Fernández de Lis; Pablo Urbiola
  16. Risk management with machine-learning-based algorithms By Simon Fecamp; Joseph Mikael; Xavier Warin
  17. A Horse Race in High Dimensional Space By Paolo Andreini; Donato Ceci

  1. By: Fritz Schiltz (University of Leuven); Paolo Sestito (Bank of Italy); Tommaso Agasisti (Politecnico di Milano); Kristof De Witte (University of Leuven, University of Maastricht)
    Abstract: School rankings based on value-added (VA) estimates are subject to prediction errors, since VA is defined as the difference between predicted and actual performance. We introduce a more flexible random forest (RF), rooted in the machine learning literature, to minimize prediction errors and to improve school rankings. Monte Carlo simulations demonstrate the advantages of this approach. Applying the proposed method to data on Italian middle schools indicates that school rankings are sensitive to prediction errors, even when extensive controls are added. RF estimates provide a low-cost way to increase the accuracy of predictions, resulting in more informative rankings, and better policies.
    Keywords: value-added, school rankings, machine learning, Monte Carlo
    JEL: I21 C50
    Date: 2019–02
  2. By: Ki Young Park (School of Economics, Yonsei University); Youngjoon Lee (School of Business, Yonsei University); Soohyon Kim (Economic Research Institute, The Bank of Korea)
    Abstract: We quantify the Monetary Policy Board (MPB) minutes of the Bank of Korea (BOK) using text mining. We propose a novel approach using a field-specific Korean dictionary and contiguous sequences of words (n-grams) to better capture the subtlety of central bank communications. We find that our lexicon-based indicators help explain the current and future BOK monetary policy decisions when considering an augmented Taylor rule, suggesting that they contain additional information beyond the currently available macroeconomic variables. Our indicators remarkably outperform English-based textual classifications, a media-based measure of economic policy uncertainty, and a data-based measure of macroeconomic uncertainty. Our empirical results also emphasize the importance of using a field-specific dictionary and the original Korean text.
    Keywords: Monetary policy; Text mining; Central banking; Bank of Korea, Taylor rule
    JEL: E43 E52 E58
    Date: 2019–01–07
  3. By: André Binette; Dmitri Tchebotarev
    Abstract: This note analyzes the evolution of the narrative in the Bank of Canada’s Monetary Policy Report (MPR). It presents descriptive statistics on the core text, including length, most frequently used words and readability level—the three Ls. Although each Governor of the Bank of Canada focuses on the macroeconomic events of the day and the mandate of inflation targeting, we observe that the language used in the MPR varies somewhat from one Governor’s tenure to the next. Our analysis also suggests that the MPR has been, on average, slightly more complicated than the average Canadian would be expected to understand. However, recent efforts to simplify the text have been successful. Using word embeddings and applying a well-established distance metric, we examine how the content of the MPR has changed over time. Increased levels of lexical innovation appear to coincide with important macroeconomic events. If substantial changes in economic conditions have been reflected in the MPR, quantifying changes in the text can help assess the perceived level of uncertainty regarding the outlook in the MPR. Lastly, we assess the sentiment (tone) in the MPR. We use a novel deep learning algorithm to measure sentiment (positive or negative) at the sentence level and aggregate the results for each MPR. The exceptionally large impacts of key events, such as 9/11, the global financial crisis and others, are easily recognizable by their significant effect on sentiment. The resulting measure can help assess the implicit balance of risks in the MPR. These measures (lexical innovations and sentiment) could then potentially serve to adjust the probability distributions around the Bank’s outlook by making them more reflective of the current situation.
    Keywords: Central bank research; Monetary Policy
    JEL: E02 E52
    Date: 2019–02
  4. By: Dorner, Matthias (Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany]); Harhoff, Dietmar; Gaessler, Fabian; Hoisl, Karin; Poege, Felix
    Abstract: "This data report describes the Linked Inventor Biography Data 1980-2014 (INV-BIO ADIAB 8014), its generation using record linkage and machine learning methods as well as how to access the data via the FDZ." (Author's abstract, IAB-Doku) ((en)) Additional Information Frequencies Auszählungen
    Keywords: Linked Inventor Biography Data, Datensatzbeschreibung, Patente, Datengewinnung, Datenaufbereitung, Datenqualität
    Date: 2019–02–04
  5. By: Victor Chernozhukov (Institute for Fiscal Studies and MIT); Vira Semenova (Institute for Fiscal Studies)
    Abstract: This paper provides estimation and inference methods for a structural function, such as Conditional Average Treatment Effect (CATE), based on modern machine learning (ML) tools. We assume that such function can be represented as a conditional expectation = of a signal , where is the unknown nuisance function. In addition to CATE, examples of such functions include regression function with Partially Missing Outcome and Conditional Average Partial Derivative. We approximate by a linear form , where is a vector of the approximating functions and is the Best Linear Predictor. Plugging in the fi rst-stage estimate into the signal , we estimate via ordinary least squares of on . We deliver a high-quality estimate of the pseudo-target function , that features (a) a pointwise Gaussian approximation of at a point , (b) a simultaneous Gaussian approximation of uniformly over x, and (c) optimal rate of convergence of to uniformly over x. In the case the misspeci cation error of the linear form decays sufficiently fast, these approximations automatically hold for the target function instead of a pseudo-target . The fi rst stage nuisance parameter is allowed to be high-dimensional and is estimated by modern ML tools, such as neural networks, -shrinkage estimators, and random forest. Using our method, we estimate the average price elasticity conditional on income using Yatchew and No (2001) data and provide uniform con fidence bands for the target regression function.
    Date: 2018–07–04
  6. By: Lineth Rodríguez (LS2N - Laboratoire des Sciences du Numérique de Nantes - UN - Université de Nantes - ECN - École Centrale de Nantes - CNRS - Centre National de la Recherche Scientifique - IMT Atlantique - IMT Atlantique Bretagne-Pays de la Loire); Mihalis Giannakis (Audencia Business School - Audencia Business School); Catherine Da Cunha (LS2N - Laboratoire des Sciences du Numérique de Nantes - UN - Université de Nantes - ECN - École Centrale de Nantes - CNRS - Centre National de la Recherche Scientifique - IMT Atlantique - IMT Atlantique Bretagne-Pays de la Loire, ECN - École Centrale de Nantes)
    Abstract: Scholars and practitioners already shown that Big Data and Predictive Analytics (BDPA) can play a pivotal role in transforming and improving the functions of sustainable supply chain analytics (SSCA). However, there is limited knowledge about how BDPA can be best leveraged to grow social, environmental and financial performance simultaneously. Therefore, with the knowledge coming from literature around SSCA, it seems that companies still struggle to implement SSCA practices. Researchers agree that is still a need to understand the techniques, tools, and enablers of the basics SSCA for its adoption; this is even more important to integrate BDPA as a strategic asset across business activities. Hence, this study will investigate, for instance, what are the enablers of SSCA, and what are the tools and techniques of BDPA that enable 3BL of sustainability performance through SCA. For this purpose, we will collect responses from structured remote questionnaires by targeting highly experienced supply chain professionals. Later, we are going to analyze the data using a well-known statistical analysis such as exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and logistics regression.
    Keywords: sustainability,supply chain analytics,big data and predictive analytics,enablers Research Proposal
    Date: 2018–03–25
  7. By: Monica Andini (Bank of Italy); Michela Boldrini (University of Bologna); Emanuele Ciani (Bank of Italy); Guido de Blasio (Bank of Italy); Alessio D'Ignazio (Bank of Italy); Andrea Paladini (University of Rome "La Sapienza")
    Abstract: We use Machine Learning (ML) predictive tools to propose a policy-assignment rule designed to increase the effectiveness of public guarantee programs. This rule can be used as a benchmark to improve targeting in order to reach the stated policy goals. Public guarantee schemes should target firms that are both financially constrained and creditworthy, but they often employ naïve assignment rules (mostly based only on the probability of default) that may lead to an inefficient allocation of resources. Examining the case of Italy’s Guarantee Fund, we suggest a benchmark ML-based assignment rule, trained and tested on credit register data. Compared with the current eligibility criteria, the ML-based benchmark leads to a significant improvement in the effectiveness of the Fund in gaining credit access to firms. We discuss the problems in estimating and using these algorithms for the actual implementation of public policies, such as transparency and omitted payoffs.
    Keywords: machine learning, program evaluation, loan guarantees
    JEL: C5 H81
    Date: 2019–02
  8. By: Klaus Gründler; Tommy Krieger
    Abstract: We compile data for 186 countries (1919 - 2016) and apply different aggregation methods to create new democracy indices. We observe that most of the available aggregation techniques produce indices that are often too favorable for autocratic regimes and too unfavorable for democratic regimes. The sole exception is a machine learning technique. Using a stylized model, we show that applying an index with implausibly low (high) scores for democracies (autocracies) in a regression analysis produces upward-biased OLS and 2SLS estimates. The results of an analysis of the effect of democracy on economic growth show that the distortions in the OLS and 2SLS estimates are substantial. Our findings imply that commonly used indices are not well suited for empirical purposes.
    Keywords: data aggregation, democracy, economic growth, indices, institutions, machine learning, measurement of democracy, non-random measurement error
    JEL: C26 C43 O10 P16 P48
    Date: 2019
  9. By: Jakub Growiec
    Abstract: The article proposes a new conceptual framework for capturing production, R&D, and economic growth in aggregative models which extend their horizon into the digital era. Two key factors of production are considered: hardware, including physical labor, traditional physical capital and programmable hardware, and software, encompassing human cognitive work, pre-programmed software, and artificial intelligence (AI). Hardware and software are complementary in production whereas their constituent components are mutually substitutable. The framework generalizes, among others, the standard model of production with capital and labor, models with capital–skill complementarity and skill-biased technical change, and unified growth theories embracing also the pre-industrial period. It offers a clear conceptual distinction between mechanization and automation as well as between robotization and the development of AI. It delivers sharp, economically intuitive predictions for long-run growth, the evolution of factor shares, and the direction of technical change
    Keywords: production function, R&D equation, technological progress, complementarity, automation, artificial intelligence.
    JEL: O30 O40 O41
    Date: 2019–02
  10. By: Colonnelli, E; Gallego, J.A.; Prem, M
    Abstract: Using rich micro data from Brazil, we show that multiple popular machine learning models display extremely high levels of performance in predicting municipality-level corruption in public spending. Measures of private sector activity, financial development, and human capital are the strongest predictors of corruption, while public sector and political features play a secondary role. Our findings have implications for the design and cost-effectiveness of various anti-corruption policies.
    Date: 2019–02–08
  11. By: Weber, Regine; Lukas, Kornher
    Abstract: With increasing Internet user rates across Africa, there is considerable interest in exploring new, online data sources. Particularly, search engine metadata, i.e. data representing the contemporaneous online-interest in a specific topic, has gained considerable interest, due to its potential to extract a near real-time online signal about the current interest of a society. The objective of this study is to analyze whether search engine metadata in the form of Google Search Query (GSQ) data can be used to improve now-casts of maize prices in nine African countries, these are Ethiopia, Kenya, Malawi, Mozambique, Rwanda, Tanzania and Uganda, Zambia and Zimbabwe. We formulate as benchmark an auto-regressive model for each country, which we subsequently augment by two specifications based on contemporary GSQ data. We test the models in in-sample, and in a pseudo out-of-sample, one-step-ahead now-casting environment and compare their forecasting errors. The GSQ specifications improve the now-casting fit in 8 out 9 countries and reduce the now-casting error between 3% and 23%. The largest improvement of maize price now-casts is achieved for Malawi, Kenya, Zambia and Tanzania, with improvements larger than 14%.
    Keywords: Agricultural and Food Policy, Research and Development/Tech Change/Emerging Technologies
    Date: 2019–02–12
  12. By: Chariton Chalvatzis; Dimitrios Hristu-Varsakelis
    Abstract: We present a deep long short-term memory (LSTM)-based neural network for predicting asset prices, together with a successful trading strategy for generating profits based on the model's predictions. Our work is motivated by the fact that the effectiveness of any prediction model is inherently coupled to the trading strategy it is used with, and vise versa. This highlights the difficulty in developing models and strategies which are jointly optimal, but also points to avenues of investigation which are broader than prevailing approaches. Our LSTM model is structurally simple and generates predictions based on price observations over a modest number of past trading days. The model's architecture is tuned to promote profitability, as opposed to accuracy, under a strategy that does not trade simply based on whether the price is predicted to rise or fall, but rather takes advantage of the distribution of predicted returns, and the fact that a prediction's position within that distribution carries useful information about the expected profitability of a trade. The proposed model and trading strategy were tested on the S&P 500, Dow Jones Industrial Average (DJIA), NASDAQ and Russel 2000 stock indices, and achieved cumulative returns of 329%, 241%, 468% and 279%, respectively, over 2010-2018, far outperforming the benchmark buy-and-hold strategy as well as other recent efforts.
    Date: 2019–02
  13. By: Alicia García-Herrero; Jianwei Xu
    Abstract: Drawing on a global database of media articles, we quantitatively assess perceptions of China’s Belt and Road Initiative (BRI) in different countries and regions. We find that the BRI is generally positively received. All regions as a whole, except South Asia, have a positive perception of the BRI, but there are marked differences at the country level, with some countries in all regions having very negative views. Interestingly, there is no significant difference in perceptions of the BRI between countries that officially participate in the BRI and those that do not. We also use our dataset of media articles to identify the topics that are most frequently associated with the BRI. The most common topics are trade and investment. Finally, we use regression analysis to identify how the frequency with which these topics are discussed in the news affects the perceptions of the BRI in different countries. We find that the more frequently trade is mentioned in the media, the more negative a country’s perception of the BRI tends to be. On the other hand, while investment under the BRI seems also to attract attention in the media, it is not statistically relevant for countries’ perceptions of the BRI.
    Date: 2019–02
  14. By: Maryam Farboodi; Roxana Mihet; Thomas Philippon; Laura Veldkamp
    Abstract: We study a model where firms accumulate data as a valuable intangible asset. Data accumulation affects firms’ dynamics. It increases the skewness of the firm size distribution as large firms generate more data and invest more in active experimentation. On the other hand, small data- savvy firms can overtake more traditional incumbents, provided they can finance their initial money- losing growth. Our model can be used to estimate the market and social value of data.
    JEL: D21 E01 L1
    Date: 2019–01
  15. By: Santiago Fernández de Lis; Pablo Urbiola
    Abstract: Digital transformation has opened the financial services market to new kinds of providers with great disruptive potential, including big technology companies. This article explores how the scope of that integration is conditioned by regulation, data access rules and competition policy.
    Keywords: Working Paper , Financial regulation , Digital economy , Digital Regulation , Digital Trends , Global
    Date: 2019–01
  16. By: Simon Fecamp; Joseph Mikael; Xavier Warin
    Abstract: We propose some machine-learning-based algorithms to solve hedging problems in incomplete markets. Sources of incompleteness cover illiquidity, untradable risk factors, discrete hedging dates and transaction costs. The proposed algorithms resulting strategies are compared to classical stochastic control techniques on several payoffs using a variance criterion. One of the proposed algorithm is flexible enough to be used with several existing risk criteria. We furthermore propose a new moment-based risk criteria.
    Date: 2019–02
  17. By: Paolo Andreini (University of Rome "Tor Vergata"); Donato Ceci (University of Rome "Tor Vergata" & Bank of Italy)
    Abstract: In this paper, we study the predictive power of dense and sparse estimators in a high dimensional space. We propose a new forecasting method, called Elastically Weighted Principal Components Analysis (EWPCA) that selects the variables, with respect to the target variable, taking into account the collinearity among the data using the Elastic Net soft thresholding. Then, we weight the selected predictors using the Elastic Net regression coefficient, and we finally apply the principal component analysis to the new “elastically” weighted data matrix. We compare this method to common benchmark and other methods to forecast macroeconomic variables in a data-rich environment, dived into dense representation, such as Dynamic Factor Models and Ridge regressions and sparse representations, such as LASSO regression. All these models are adapted to take into account the linear dependency of the macroeconomic time series. Moreover, to estimate the hyperparameters of these models, including the EWPCA, we propose a new procedure called “brute force”. This method allows us to treat all the hyperparameters of the model uniformly and to take the longitudinal feature of the time-series data into account. Our findings can be summarized as follows. First, the “brute force” method to estimate the hyperparameters is more stable and gives better forecasting performances, in terms of MSFE, than the traditional criteria used in the literature to tune the hyperparameters. This result holds for all samples sizes and forecasting horizons. Secondly, our two-step forecasting procedure enhances the forecasts’ interpretability. Lastly, the EWPCA leads to better forecasting performances, in terms of mean square forecast error (MSFE), than the other sparse and dense methods or naïve benchmark, at different forecasts horizons and sample sizes.
    Keywords: Variable selection,High-dimensional time series,Dynamic factor models,Shrinkage methods,Cross-validation
    JEL: C22 C52 C53
    Date: 2019–02–14

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.