nep-big New Economics Papers
on Big Data
Issue of 2021‒01‒18
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. COVID19-HPSMP: COVID-19 Adopted Hybrid and Parallel Deep Information Fusion Framework for Stock Price Movement Prediction By Farnoush Ronaghi; Mohammad Salimibeni; Farnoosh Naderkhani; Arash Mohammadi
  2. Fintech Credit Risk Assessment for SMEs: Evidence from China By Yiping Huang; Longmei Zhang; Zhenhua Li; Han Qiu; Tao Sun; Xue Wang
  3. Day-ahead electricity prices prediction applying hybrid models of LSTM-based deep learning methods and feature selection algorithms under consideration of market coupling By Wei Li; Denis Mike Becker
  4. The Missing 15 Percent of Patent Citations By Verluise, Cyril; Cristelli, Gabriele; Higham, Kyle; de Rassenfosse, Gaetan
  5. Using Predictive Analytics for Public Policy: The Case for Lost Work due to the COVID-19 By Cheng, Kent Jason Go
  6. The Hard Problem of Prediction for Conflict Prevention By Mueller, H.; Rauh, C.
  7. Text analysis in financial disclosures By Sridhar Ravula
  8. Economic Policy Uncertainty index meets ensemble learning By Ivana Loli?; Petar Sori?; Marija Logaru?i?
  9. A Machine-Learning History of English Caselaw and Legal Ideas Prior to the Industrial Revolution I: Generating and Interpreting the Estimates By Peter Grajzl; Peter Murrell
  10. Forecasting the Olympic medal distribution during a pandemic: a socio-economic machine learning model By Christoph Schlembach; Sascha L. Schmidt; Dominik Schreyer; Linus Wunderlich
  11. Improving the Short-term Forecast of World Trade During the Covid-19 Pandemic Using Swift Data on Letters of Credit By Benjamin Carton; Nan Hu; Joannes Mongardini; Kei Moriya; Aneta Radzikowski
  12. A Machine-Learning History of English Caselaw and Legal Ideas Prior to the Industrial Revolution II: Applications By Peter Grajzl; Peter Murrell
  13. Robots, AI, and Related Technologies: A Mapping of the New Knowledge Base By Enrico Santarelli; Jacopo Staccioli; Marco Vivarelli
  14. Predicting COVID-19 Spread Level using Socio-Economic Indicators and Machine Learning Techniques By Alaeddine Mihoub; Hosni Snoun; Moez Krichen; Montassar Kahia; Riadh Bel Hadj Salah
  15. Demand Shocks, Procurement Policies, and the Nature of Medical Innovation: Evidence from Wartime Prosthetic Device Patents By Jeffrey P. Clemens; Parker Rogers
  16. Why are some U.S. cities successful, while others are not? Empirical evidence from machine learning By Damien Azzopardi; Fozan Fareed; Patrick Lenain; Douglas Sutherland
  17. The Improvement of Retargeting by Big Data: a Decision Support that Threatens the Brand Image? By Maria Mercanti-Guérin
  18. Whose intelligence is artificial intelligence? By Paola Tubaro

  1. By: Farnoush Ronaghi; Mohammad Salimibeni; Farnoosh Naderkhani; Arash Mohammadi
    Abstract: The novel of coronavirus (COVID-19) has suddenly and abruptly changed the world as we knew at the start of the 3rd decade of the 21st century. Particularly, COVID-19 pandemic has negatively affected financial econometrics and stock markets across the globe. Artificial Intelligence (AI) and Machine Learning (ML)-based prediction models, especially Deep Neural Network (DNN) architectures, have the potential to act as a key enabling factor to reduce the adverse effects of the COVID-19 pandemic and future possible ones on financial markets. In this regard, first, a unique COVID-19 related PRIce MOvement prediction (COVID19 PRIMO) dataset is introduced in this paper, which incorporates effects of social media trends related to COVID-19 on stock market price movements. Afterwards, a novel hybrid and parallel DNN-based framework is proposed that integrates different and diversified learning architectures. Referred to as the COVID-19 adopted Hybrid and Parallel deep fusion framework for Stock price Movement Prediction (COVID19-HPSMP), innovative fusion strategies are used to combine scattered social media news related to COVID-19 with historical mark data. The proposed COVID19-HPSMP consists of two parallel paths (hence hybrid), one based on Convolutional Neural Network (CNN) with Local/Global Attention modules, and one integrated CNN and Bi-directional Long Short term Memory (BLSTM) path. The two parallel paths are followed by a multilayer fusion layer acting as a fusion centre that combines localized features. Performance evaluations are performed based on the introduced COVID19 PRIMO dataset illustrating superior performance of the proposed framework.
    Date: 2021–01
  2. By: Yiping Huang; Longmei Zhang; Zhenhua Li; Han Qiu; Tao Sun; Xue Wang
    Abstract: Promoting credit services to small and medium-size enterprises (SMEs) has been a perennial challenge for policy makers globally due to high information costs. Recent fintech developments may be able to mitigate this problem. By leveraging big data or digital footprints on existing platforms, some big technology (BigTech) firms have extended short-term loans to millions of small firms. By analyzing 1.8 million loan transactions of a leading Chinese online bank, this paper compares the fintech approach to assessing credit risk using big data and machine learning models with the bank approach using traditional financial data and scorecard models. The study shows that the fintech approach yields better prediction of loan defaults during normal times and periods of large exogenous shocks, reflecting information and modeling advantages. BigTech’s proprietary information can complement or, where necessary, substitute credit history in risk assessment, allowing unbanked firms to borrow. Furthermore, the fintech approach benefits SMEs that are smaller and in smaller cities, hence complementing the role of banks by reaching underserved customers. With more effective and balanced policy support, BigTech lenders could help promote financial inclusion worldwide.
    Keywords: Fintech;Machine learning;Bank credit;Loans;Credit risk;WP,credit history,Fintech firm,house ownership,internet company,real-time customer rating
    Date: 2020–09–25
  3. By: Wei Li; Denis Mike Becker
    Abstract: The availability of accurate day-ahead electricity price forecasts is pivotal for electricity market participants. In the context of trade liberalisation and market harmonisation in the European markets, accurate price forecasting becomes even more difficult to obtain. The increasing power market integration has complicated the forecasting process, where electricity forecasting requires considering features from both the local market and ever-growing coupling markets. In this paper, we apply state-of-the-art deep learning models, combined with feature selection algorithms for electricity price prediction under the consideration of market coupling. We propose three hybrid architectures of long-short term memory (LSTM) deep neural networks and compare the prediction performance, in terms of various feature selections. In our empirical study, we construct a broad set of features from the Nord Pool market and its six coupling countries for forecasting the Nord Pool system price. The results show that feature selection is essential to achieving accurate prediction. Superior feature selection algorithms filter meaningful information, eliminate irrelevant information, and further improve the forecasting accuracy of LSTM-based deep neural networks. The proposed models obtain considerably accurate results.
    Date: 2021–01
  4. By: Verluise, Cyril; Cristelli, Gabriele; Higham, Kyle; de Rassenfosse, Gaetan
    Abstract: Patent citations are one of the most commonly-used metrics in the innovation literature. Leading uses of patent-to-patent citations are associated with the quantification of inventions' quality and the measurement of knowledge flows. Due to their widespread availability, scholars have exploited citations listed on the front-page of patent documents. Citations appearing in the full-text of patent documents have been neglected. We apply modern machine learning methods to extract these citations from the text of USPTO patent documents. Overall, we are able to recover an additional 15 percent of patent citations that could not be found using only front-page data. We show that "in-text" citations bring a different type of information compared to front-page citations. They exhibit higher text-similarity to the citing patents and alter the ranking of patent importance. The dataset is available at (CC-BY-4).
    Date: 2020–12–23
  5. By: Cheng, Kent Jason Go (Syracuse University)
    Abstract: In this brief research article, I demonstrate how predictive analytics or machine learning can be used to predict outcomes that are of interest in public policy. I developed a predictive model that determined who were not able to work during the past four weeks because the COVID-19 pandemic led their employer to close or lose business. I used the Current Population Survey (CPS) collected from May to November 2020 (N=352,278). Predictive models considered were logistic regression and ensemble-based methods (bagging of regression trees, random forests, and boosted regression trees). Predictors included (1) individual-, (2) family-, (3) and community or societal- level factors. To validate the models, I used the random training test splits with equal allocation of samples for the training and testing data. The random forest with the full set of predictors and number of splits set to the square root of the number of predictors yielded the lowest testing error rate. Predictive analytics that seek to forecast the inability to work due to the pandemic can be used for automated means-testing to determine who gets aid like unemployment benefits or food stamps.
    Date: 2021–01–03
  6. By: Mueller, H.; Rauh, C.
    Abstract: There is a growing interest in prevention in several policy areas and this provides a strong motivation for an improved integration of forecasting with machine learning into models of decision making. In this article we propose a framework to tackle conflict prevention. A key problem of conflict forecasting for prevention is that predicting the start of conflict in previously peaceful countries needs to overcome a low baseline risk. To make progress in this hard problem this project combines a newspaper-text corpus of more than 4 million articles with unsupervised and supervised machine learning. The output of the forecast model is then integrated into a simple static framework in which a decision maker decides on the optimal number of interventions to minimize the total cost of conflict and intervention. This exercise highlights the potential cost savings of prevention for which reliable forecasts are a prerequisite.
    Date: 2021–01–06
  7. By: Sridhar Ravula
    Abstract: Financial disclosure analysis and Knowledge extraction is an important financial analysis problem. Prevailing methods depend predominantly on quantitative ratios and techniques, which suffer from limitations like window dressing and past focus. Most of the information in a firm's financial disclosures is in unstructured text and contains valuable information about its health. Humans and machines fail to analyze it satisfactorily due to the enormous volume and unstructured nature, respectively. Researchers have started analyzing text content in disclosures recently. This paper covers the previous work in unstructured data analysis in Finance and Accounting. It also explores the state of art methods in computational linguistics and reviews the current methodologies in Natural Language Processing (NLP). Specifically, it focuses on research related to text source, linguistic attributes, firm attributes, and mathematical models employed in the text analysis approach. This work contributes to disclosure analysis methods by highlighting the limitations of the current focus on sentiment metrics and highlighting broader future research areas
    Date: 2021–01
  8. By: Ivana Loli? (Faculty of Economics & Business, University of Zagreb); Petar Sori? (Faculty of Economics & Business, University of Zagreb); Marija Logaru?i? (Faculty of Economics & Business, University of Zagreb)
    Abstract: We utilize two specific ensemble learning methods (ensemble linear regression model (LM) and random forest (RF)), in a data-rich environment of the Newsbank media database to scrutinize the possibilities of enhancing the predictive accuracy of Economic Policy Uncertainty (EPU) index. LM procedure mostly outperforms both RF-based assessments and the original EPU index. We find that our LM estimate behaves more like an uncertainty indicator that the RF-based uncertainty or the original EPU index. It is strongly correlated to other standard uncertainty proxies, it is more countercyclical, and it has more pronounced leading properties. Finally, we considerably widen the scope of search terms included in the calculation of EPU index. We find that the predictive precision of EPU index can be considerably increased using a more diversified set of uncertainty-related terms than the original EPU framework.
    Keywords: Economic Policy Uncertainty Index; textual analysis; ensemble learning; random forest model
    JEL: C55 E03 E32
  9. By: Peter Grajzl; Peter Murrell
    Abstract: The history of England’s institutions has long informed research on comparative economic development. Yet to date there exists no quantitative evidence on a core aspect of England’s institutional evolution, that embodied in the accumulated decisions of English courts. Focusing on the two centuries before the Industrial Revolution, we generate and analyze the first quantitative estimates of the development of English caselaw and its associated legal ideas. We achieve this in two companion papers. In this, the first of the pair, we build a comprehensive corpus of 52,949 reports of cases heard in England's high courts before 1765. Estimating a 100-topic structural topic model, we name and interpret all topics, each of which reflects a distinctive aspect of English legal thought. We produce time series of the estimated topic prevalences. To interpret the topic timelines, we develop a tractable model of the evolution of legal-cultural ideas and their prominence in case reports. In the companion paper, we will illustrate with multiple applications the usefulness of the large amount of new information generated by our approach.
    Keywords: English history, institutional development, machine learning, caselaw, idea diffusion
    JEL: C80 N00 K10 Z10 P10
    Date: 2020
  10. By: Christoph Schlembach; Sascha L. Schmidt; Dominik Schreyer; Linus Wunderlich
    Abstract: Forecasting the number of Olympic medals for each nation is highly relevant for different stakeholders: Ex ante, sports betting companies can determine the odds while sponsors and media companies can allocate their resources to promising teams. Ex post, sports politicians and managers can benchmark the performance of their teams and evaluate the drivers of success. To significantly increase the Olympic medal forecasting accuracy, we apply machine learning, more specifically a two-staged Random Forest, thus outperforming more traditional na\"ive forecast for three previous Olympics held between 2008 and 2016 for the first time. Regarding the Tokyo 2020 Games in 2021, our model suggests that the United States will lead the Olympic medal table, winning 120 medals, followed by China (87) and Great Britain (74). Intriguingly, we predict that the current COVID-19 pandemic will not significantly alter the medal count as all countries suffer from the pandemic to some extent (data inherent) and limited historical data points on comparable diseases (model inherent).
    Date: 2020–12
  11. By: Benjamin Carton; Nan Hu; Joannes Mongardini; Kei Moriya; Aneta Radzikowski
    Abstract: An essential element of the work of the Fund is to monitor and forecast international trade. This paper uses SWIFT messages on letters of credit, together with crude oil prices and new export orders of manufacturing Purchasing Managers’ Index (PMI), to improve the short-term forecast of international trade. A horse race between linear regressions and machine-learning algorithms for the world and 40 large economies shows that forecasts based on linear regressions often outperform those based on machine-learning algorithms, confirming the linear relationship between trade and its financing through letters of credit.
    Keywords: Oil prices;Imports;Exports;Trade finance;Trade balance;SWIFT,trade forecast,machine learning,WP,world trade,trade message,Brent crude oil price,trade advance,letter of credit,linear regression forecast,Merchandise trade,World trade sample
    Date: 2020–11–13
  12. By: Peter Grajzl; Peter Murrell
    Abstract: This is the second of two papers that generate and analyze quantitative estimates of the development of English caselaw and associated legal ideas before the Industrial Revolution. In the first paper, we estimated a 100-topic structural topic model, named the topics, and showed how to interpret topic-prevalence timelines. Here, we provide examples of new insights that can be gained from these estimates. We first provide a bird's-eye view, aggregating the topics into fifteen themes. Procedure is the highest-prevalence theme, but by the mid-18th century attention to procedure decreases sharply, indicating solidification of court institutions. Important ideas on real-property were substantially settled by the mid-17th century and on contracts and torts by the mid-18th century. Thus, crucial elements of caselaw developed before the Industrial Revolution. We then examine the legal ideas associated with England's financial revolution. Many new legal ideas relevant to finance were well accepted before the Glorious Revolution. Finally, we examine the sources of law used in the courts. Emphasis on precedent-based reasoning increases by 1650, but diffusion was gradual, with pertinent ideas solidifying only after 1700. Ideas on statute applicability were accepted by the mid-16th century but debates on the legislature’s intent still occurred in 1750.
    Keywords: English history, institutional development, caselaw, financial revolution, sources of law
    JEL: C80 N00 K10 O43
    Date: 2020
  13. By: Enrico Santarelli; Jacopo Staccioli; Marco Vivarelli
    Abstract: Using the entire population of USPTO patent applications published between 2002 and 2019, and leveraging on both patent classification and semantic analysis, this papers aims to map the current knowledge base centred on robotics and AI technologies. These technologies will be investigated both as a whole and distinguishing core and related innovations, along a 4-level core-periphery architecture. Merging patent applications with the Orbis IP firm-level database will allow us to put forward a threefold analysis based on industry of activity, geographic location, and firm productivity. In a nutshell, results show that: (i) rather than representing a technological revolution, the new knowledge base is strictly linked to the previous technological paradigm; (ii) the new knowledge base is characterised by a considerable - but not impressively widespread - degree of pervasiveness; (iii) robotics and AI are strictly related, converging (particularly among the related technologies) and jointly shaping a new knowledge base that should be considered as a whole, rather than consisting of two separate GPTs; (iv) the U.S. technological leadership turns out to be confirmed.
    Keywords: Robotics; Artificial Intelligence; General Purpose Technology; Technological Paradigm; Industry 4.0; Patents full-text.
    Date: 2021–01–11
  14. By: Alaeddine Mihoub; Hosni Snoun; Moez Krichen (REDCAD - Unité de Recherche en développement et contrôle d'applications distribuées - ENIS - École Nationale d'Ingénieurs de Sfax | National School of Engineers of Sfax); Montassar Kahia; Riadh Bel Hadj Salah
    Abstract: The new so-called COVID-19 virus is unfortunately founded to be highly transmissible across the globe. In this study, we propose a novel approach for estimating the spread level of the virus for each country for three different dates between April and May 2020. Unlike previous studies, this investigation does not process any historical data of spread but rather relies on the socioeconomic indicators of each country. Actually, more than 1000 socioeconomic indicators and more than 190 countries were processed in this study. Concretely, data preprocessing techniques and feature selection approaches were applied to extract relevant indicators for the classification process. Countries around the globe were assigned to 4 classes of spread. To find the class level of each country, many classifiers were proposed based especially on Support Vectors Machines (SVM), Multi-Layer Perceptrons (MLP) and Random Forests (RF). Obtained results show the relevance of our approach since many classifiers succeeded in capturing the spread level, especially the RF classifier, with an F-measure equal to 93.85% for April 15th, 2020. Moreover, a feature importance study is conducted to deduce the best indicators to build robust spread level classifiers. However, as pointed out in the discussion, classifiers may face some difficulties for future dates since the huge increase of cases and the lack of other relevant factors affecting this widespread.
    Keywords: covid-19,socio-economic indicators,data preprocessing,spread level prediction,machine learning,country classification,coronavirus,SARS-CoV-2,feature importance
    Date: 2020–11–03
  15. By: Jeffrey P. Clemens; Parker Rogers
    Abstract: We analyze wartime prosthetic device patents to investigate how demand and procurement policy can shape medical innovation. We use machine learning tools to develop new data describing the aspects of medical and mechanical innovations that are emphasized in patent documents. Our analysis of historical patents yields three primary facts. First, we find that the U.S. Civil War and World War I led to substantial increases in the quantity of prosthetic device patenting relative to patenting in other medical and mechanical technology classes. Second, we find that the Civil War led inventors to focus broadly on improving aspects of the production process, while World War I did not, consistent with the United States applying a more cost-conscious procurement model during the Civil War. Third, we find that inventors emphasized dimensions of product quality (e.g., a prosthetic’s appearance or comfort) that aligned with differences in buyers’ preferences, as described in the historical record, across wars. We conclude that procurement environments can significantly shape the scientific problems with which inventors engage, including the choice to innovate on quality or cost.
    Keywords: procurement, medical innovation, health care, health economics
    JEL: H57 I10 O31
    Date: 2020
  16. By: Damien Azzopardi; Fozan Fareed; Patrick Lenain; Douglas Sutherland
    Abstract: The U.S. population has become increasingly concentrated in large metropolitan areas. However, there are striking differences in between the performances of big cities: some of them have been very successful and have been able to pull away from the rest, while others have stagnated or even declined. The main objective of this paper is to characterize U.S. metropolitan areas according to their labor-market performance: which metropolitan areas are struggling and falling behind? Which ones are flourishing? Which ones are staying resilient by adapting to shocks? We rely on an unsupervised machine learning technique called Hierarchical Agglomerative Clustering (HAC) to conduct this empirical investigation. The data comes from a number of sources including the new Job-to-Job (J2J) flows dataset from the Census Bureau, which reports the near universe of job movements in and out of employment at the metropolitan level. We characterize the fate of metropolitan areas by tracking their job mobility rate, unemployment rate, income growth, population increase, net change in job-to-job mobility and GDP growth. Our results indicate that the 372 metropolitan areas under examination can be categorized into four statistically distinct groups: booming areas (67), prosperous mega metropolitan areas (99), resilient areas (149) and distressed metropolitan areas (57). The results show that areas that are doing well are predominantly located in the south and the west. The main features of their success have revolved around embracing digital technologies, adopting local regulations friendly to job mobility and business creation, avoiding strict rules on land-use and housing market, and improving the wellbeing of the city’s population. These results highlight that cities adopting well-targeted policies can accelerate the return to growth after a shock.
    Keywords: clustering analysis, job-to-job flows, Labour mobility, metropolitan areas, United States
    JEL: E24 J11 J61 C38 O51
    Date: 2020–12–18
  17. By: Maria Mercanti-Guérin (IAE Paris - Sorbonne Business School)
    Abstract: With the emergence of Big Data and the increasing market penetration of ad retargeting advertising, the advertising industry's interest in using this new online marketing method is rising. Retargeting is an innovative technology based on Big Data. People who have gone to a merchant site and window-shopped but not purchased can be re-pitched with the product they showed an interest in. Therefore click rates and conversion rates are dramatically enhancing by retargeting. However, in spite of the increasing number of companies investing in retargeting, there is little academic research on this topic. In this paper we explore the links between retargeting, perceived intrusiveness and brand image. As results show the importance of perceived intrusiveness, ad repetition and ad relevance, we introduce new analytical perspectives on online strategies with the goal of facilitating collaboration between consumers and marketers.
    Keywords: Big Data,Retargeting,Perceived Intrusiveness,Ad Relevance
    Date: 2020–10–12
  18. By: Paola Tubaro (TAU - TAckling the Underspecified - Inria Saclay - Ile de France - Inria - Institut National de Recherche en Informatique et en Automatique - LRI - Laboratoire de Recherche en Informatique - CentraleSupélec - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique, LRI - Laboratoire de Recherche en Informatique - CentraleSupélec - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique, IDHES - Institutions et Dynamiques Historiques de l'Économie et de la Société - ENS Cachan - École normale supérieure - Cachan - UP1 - Université Panthéon-Sorbonne - UP8 - Université Paris 8 Vincennes-Saint-Denis - UPN - Université Paris Nanterre - CNRS - Centre National de la Recherche Scientifique - UEVE - Université d'Évry-Val-d'Essonne, CNRS - Centre National de la Recherche Scientifique)
    Date: 2020–04

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.