nep-big New Economics Papers
on Big Data
Issue of 2021‒01‒11
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. The Value Added of Machine Learning to Causal Inference: Evidence from Revisited Studies By Anna Baiardi; Andrea A. Naghi
  2. Forecasting Mid-price Movement of Bitcoin Futures Using Machine Learning By Akyildirim, Erdinc; Cepni, Oguzhan; Corbet, Shaen; Uddin, Gazi Salah
  3. The Cross-Sectional Pricing of Corporate Bonds Using Big Data and Machine Learning By Turan G. Bali; Amit Goyal; Dashan Huang; Fuwei Jiang; Quan Wen
  4. Machine Learning Advances for Time Series Forecasting By Ricardo P. Masini; Marcelo C. Medeiros; Eduardo F. Mendes
  5. A Comparison of Statistical and Machine Learning Algorithms for Predicting Rents in the San Francisco Bay Area By Paul Waddell; Arezoo Besharati-Zadeh
  6. Trader-Company Method: A Metaheuristic for Interpretable Stock Price Prediction By Katsuya Ito; Kentaro Minami; Kentaro Imajo; Kei Nakagawa
  7. The Missing 15 Percent of Patent Citations By Cyril Verluise; Gabriele Cristelli; Kyle Higham; Gaetan de Rassenfosse
  8. A machine learning solver for high-dimensional integrals: Solving Kolmogorov PDEs by stochastic weighted minimization and stochastic gradient descent through a high-order weak approximation scheme of SDEs with Malliavin weights By Riu Naito; Toshihiro Yamada
  9. Recent Trends in Real Estate Research: A Comparison of Recent Working Papers and Publications using Machine Learning Algorithms By Breuer, Wolfgang; Steininger, Bertram
  10. Mobile Applications Aiming to Facilitate Immigrants' Societal Integration and Overall Level of Integration, Health and Mental Health: Does Artificial Intelligence Enhance Outcomes? By Drydakis, Nick
  11. Does Big Data Improve Financial Forecasting? The Horizon Effect By Olivier Dessaint; Thierry Foucault; Laurent Frésard
  12. The Unintended Consequences of Stay-at-Home Policies on Work Outcomes: The Impacts of Lockdown Orders on Content Creation By Xunyi Wang; Reza Mousavi; Yili Hong
  13. Nowcasting Networks By Marc Chataigner; Stephane Crepey; Jiang Pu
  14. Exploration of model performances in the presence of heterogeneous preferences and random effects utilities awareness By Nikita Gusarov; Amirreza Talebijamalabad; Iragaël Joly
  15. Automation and Robots in Services: Review of Data and Taxonomy By Matteo Sostero
  16. First Time Around: Local Conditions and Multi-dimensional Integration of Refugees By Cevat Giray Aksoy; Panu Poutvaara; Felicitas Schikora
  17. Künstliche Intelligenz im Telekommunikationssektor – Bedeutung, Entwicklungsperspektiven und regulatorische Implikationen By Lundborg, Martin; Märkel, Christian; Schrade-Grytsenko, Lisa; Stamm, Peter

  1. By: Anna Baiardi (Erasmus University Rotterdam); Andrea A. Naghi (Erasmus University Rotterdam)
    Abstract: A new and rapidly growing econometric literature is making advances in the problem of using machine learning (ML) methods for causal inference questions. Yet, the empirical economics literature has not started to fully exploit the strengths of these modern methods. We revisit influential empirical studies with causal machine learning methods and identify several advantages of using these techniques. We show that these advantages and their implications are empirically relevant and that the use of these methods can improve the credibility of causal analysis.
    Keywords: Machine learning, causal inference, average treatment effects, heterogeneous treatment effects
    JEL: D04 C01 C21
    Date: 2021–01–04
  2. By: Akyildirim, Erdinc (Department of Mathematics, ETH, Zurich, Switzerland and University of Zurich, Department of Banking and Finance, Zurich, Switzerland and Department of Banking and Finance, Burdur Mehmet Akif Ersoy University, Burdur, Turkey); Cepni, Oguzhan (Department of Economics, Copenhagen Business School); Corbet, Shaen (DCU Business School, Dublin City University, Dublin 9, Ireland and School of Accounting, Finance and Economics, University of Waikato, New Zealand); Uddin, Gazi Salah (Department of Management and Engineering, Linköping University, 581 83 Linköping, Sweden)
    Abstract: In the aftermath of the global financial crisis and on-going COVID-19, investors face challenges in understanding price dynamics across assets. In this paper, we explore the applicability of a large scale comparison of machine learning algorithms (MLA) to predict mid-price movement for bitcoin futures prices. We use high-frequency intra-day data to evaluate the relative forecasting performances across various time-frequencies, ranging between 5-minutes and 60-minutes. The empirical analysis is based on six different specifications of MLA methods during periods of pandemic. The empirical results show that MLA outperforms the random walk and ARIMA forecasts in Bitcoin futures markets, which may have important implications in the decision-making process of predictability.
    Keywords: Cryptocurrency; Bitcoin futures; Machine learning; Covid-19; k-Nearest neighbors; Logistic regression; Naive bayes; Random forest; Support vector machine; Extreme gradient; Boosting
    JEL: C60 E50
    Date: 2020–12–22
  3. By: Turan G. Bali (Georgetown University - Robert Emmett McDonough School of Business); Amit Goyal (University of Lausanne; Swiss Finance Institute); Dashan Huang (Singapore Management University - Lee Kong Chian School of Business); Fuwei Jiang (Central University of Finance and Economics (CUFE)); Quan Wen (Georgetown University - Department of Finance)
    Abstract: We provide a comprehensive study on the cross-sectional predictability of corporate bond returns using big data and machine learning. We examine whether a large set of equity and bond characteristics drive the expected returns on corporate bonds. Using either set of characteristics, we find that machine learning methods substantially improve the out-of-sample predictive power for bond returns, compared to the traditional linear regression models. While equity characteristics produce significant explanatory power for bond returns, their incremental predictive power relative to bond characteristics is economically and statistically insignificant. Bond characteristics provide as strong forecasting power for future equity returns as using equity characteristics alone. However, bond characteristics do not offer additional predictive power above and beyond equity characteristics when we combine both sets of predictors.
    Keywords: machine learning, big data, corporate bond returns, cross-sectional return predictability
    JEL: G10 G11 C13
    Date: 2020–09
  4. By: Ricardo P. Masini; Marcelo C. Medeiros; Eduardo F. Mendes
    Abstract: In this paper we survey the most recent advances in supervised machine learning and high-dimensional models for time series forecasting. We consider both linear and nonlinear alternatives. Among the linear methods we pay special attention to penalized regressions and ensemble of models. The nonlinear methods considered in the paper include shallow and deep neural networks, in their feed-forward and recurrent versions, and tree-based methods, such as random forests and boosted trees. We also consider ensemble and hybrid models by combining ingredients from different alternatives. Tests for superior predictive ability are briefly reviewed. Finally, we discuss application of machine learning in economics and finance and provide an illustration with high-frequency financial data.
    Date: 2020–12
  5. By: Paul Waddell; Arezoo Besharati-Zadeh
    Abstract: Urban transportation and land use models have used theory and statistical modeling methods to develop model systems that are useful in planning applications. Machine learning methods have been considered too 'black box', lacking interpretability, and their use has been limited within the land use and transportation modeling literature. We present a use case in which predictive accuracy is of primary importance, and compare the use of random forest regression to multiple regression using ordinary least squares, to predict rents per square foot in the San Francisco Bay Area using a large volume of rental listings scraped from the Craigslist website. We find that we are able to obtain useful predictions from both models using almost exclusively local accessibility variables, though the predictive accuracy of the random forest model is substantially higher.
    Date: 2020–11
  6. By: Katsuya Ito; Kentaro Minami; Kentaro Imajo; Kei Nakagawa
    Abstract: Investors try to predict returns of financial assets to make successful investment. Many quantitative analysts have used machine learning-based methods to find unknown profitable market rules from large amounts of market data. However, there are several challenges in financial markets hindering practical applications of machine learning-based models. First, in financial markets, there is no single model that can consistently make accurate prediction because traders in markets quickly adapt to newly available information. Instead, there are a number of ephemeral and partially correct models called "alpha factors". Second, since financial markets are highly uncertain, ensuring interpretability of prediction models is quite important to make reliable trading strategies. To overcome these challenges, we propose the Trader-Company method, a novel evolutionary model that mimics the roles of a financial institute and traders belonging to it. Our method predicts future stock returns by aggregating suggestions from multiple weak learners called Traders. A Trader holds a collection of simple mathematical formulae, each of which represents a candidate of an alpha factor and would be interpretable for real-world investors. The aggregation algorithm, called a Company, maintains multiple Traders. By randomly generating new Traders and retraining them, Companies can efficiently find financially meaningful formulae whilst avoiding overfitting to a transient state of the market. We show the effectiveness of our method by conducting experiments on real market data.
    Date: 2020–12
  7. By: Cyril Verluise (Collège de France); Gabriele Cristelli (Ecole polytechnique federale de Lausanne); Kyle Higham (Hitotsubashi University); Gaetan de Rassenfosse (Ecole polytechnique federale de Lausanne)
    Abstract: Patent citations are one of the most commonly-used metrics in the innovation literature. Leading uses of patent-to-patent citations are associated with the quantification of inventions’ quality and the measurement of knowledge flows. Due to their widespread availability, scholars have exploited citations listed on the front-page of patent documents. Citations appearing in the full-text of patent documents have been neglected. We apply modern machine learning methods to extract these citations from the text of USPTO patent documents. Overall, we are able to recover an additional 15 percent of patent citations that could not be found using only front-page data. We show that in-text citations bring a different type of information compared to front-page citations. They exhibit higher text-similarity to the citing patents and alter the ranking of patent importance. The dataset is available at (CC-BY-4).
    Keywords: Citation; Patent; Open data
    JEL: C81 O30
    Date: 2020–12
  8. By: Riu Naito; Toshihiro Yamada
    Abstract: The paper introduces a very simple and fast computation method for high-dimensional integrals to solve high-dimensional Kolmogorov partial differential equations (PDEs). The new machine learning-based method is obtained by solving a stochastic weighted minimization with stochastic gradient descent which is inspired by a high-order weak approximation scheme for stochastic differential equations (SDEs) with Malliavin weights. Then solutions to high-dimensional Kolmogorov PDEs or expectations of functionals of solutions to high-dimensional SDEs are accurately approximated without suffering from the curse of dimensionality. Numerical examples for PDEs and SDEs up to 100 dimensions are shown by using second and third-order discretization schemes in order to demonstrate the effectiveness of our method.
    Date: 2020–12
  9. By: Breuer, Wolfgang (RWTH Aachen University, Department of Finance, Aachen, Germany); Steininger, Bertram (Department of Real Estate and Construction Management, Royal Institute of Technology)
    Abstract: This paper is organized as follows. In Section 1, we describe the economic relevance of the real estate sector and its recent dynamics. Then, we identify the most mentioned keywords of working papers presented at the real estate conferences between 2015 and 2019 and showing network figures for them in Section 2. In order to identify the newest trends, we rely on working papers since they have an average lead time of at least 1 to 2 years before they are published. In addition, we give a short overview of the articles published in this special issue. To get a better overview of the relevance of real estate related topics in finance, we analyzed the most relevant finance conferences and journals between 2015 and May 2020 in Section 3. To find the topics, we apply the text mining approach Latent Dirichlet Allocation (LDA), an unsupervised machine learning method. The real estate trends (retail, e-commerce) and the potential impact of COVID-19 is described in Section 4.
    Keywords: Recent trends; Real estate; Machine learning; Latent Dirichlet Allocation; LDA
    JEL: C45 C80 R30
    Date: 2020–12–28
  10. By: Drydakis, Nick (Anglia Ruskin University)
    Abstract: Using panel data on immigrant populations from European, Asian and African countries the study estimates positive associations between the number of mobile applications in use aiming to facilitate immigrants' societal integration (m-Integration) and increased level of integration (Ethnosizer), good overall health (EQ-VAS) and mental health (CESD-20). It is estimated that the patterns are gender sensitive. In addition, it is found that m-Integration applications in relation to translation and voice assistants, public services, and medical services provide the highest returns on immigrants' level of integration, health/mental health status. For instance, translation and voice assistant applications are associated with a 4% increase in integration and a 0.8% increase in good overall health. Moreover, m-Integration applications aided by artificial intelligence (AI) are associated with increased health/mental health and integration levels among immigrants. We indicate that AI by providing customized search results, peer reviewed e-learning, professional coaching on pronunciation, real-time translations, and virtual communication for finding possible explanations for health conditions might bring better quality services facilitating immigrants' needs. This is the first known study to introduce the term 'm-Integration', quantify associations between applications, health/mental health and integration for immigrants, and assess AI's role in enhancing the aforementioned outcomes.
    Keywords: mobile applications, m-Integration, m-Health, artificial intelligence, integration, immigrants, refugees, health, mental health
    JEL: O3 O31 I1 J15
    Date: 2020–12
  11. By: Olivier Dessaint (INSEAD); Thierry Foucault (HEC Paris - Finance Department); Laurent Frésard (Universita della Svizzera italiana (USI Lugano); Swiss Finance Institute)
    Abstract: We study how data abundance affects the informativeness of financial analysts' forecasts at various horizons. Analysts forecast short-term and long-term earnings and choose how much information to process about each horizon to minimize forecasting error, net of information processing costs. When the cost of obtaining short-term information drops (i.e., more data becomes available), analysts change their information processing strategy in a way that renders their short-term forecasts more informative but that possibly reduces the informativeness of their long-term forecasts. We provide empirical support for this prediction using a large sample of forecasts at various horizons and novel measures of analysts' exposure to abundant data. Data abundance can thus impair the quality of long-term financial forecasts.ty of long-term forecasts.
    Keywords: Big data, Financial analysts' forecasts, Forecasting horizon, Forecasts' informativeness, Social media
    JEL: D84 G14 G17 M41
    Date: 2020–11
  12. By: Xunyi Wang; Reza Mousavi; Yili Hong
    Abstract: The COVID-19 pandemic has posed an unprecedented challenge to individuals around the globe. To mitigate the spread of the virus, many states in the U.S. issued lockdown orders to urge their residents to stay at their homes, avoid get-togethers, and minimize physical interactions. While many offline workers are experiencing significant challenges performing their duties, digital technologies have provided ample tools for individuals to continue working and to maintain their productivity. Although using digital platforms to build resilience in remote work is effective, other aspects of remote work (beyond the continuation of work) should also be considered in gauging true resilience. In this study, we focus on content creators, and investigate how restrictions in individual's physical environment impact their online content creation behavior. Exploiting a natural experimental setting wherein four states issued state-wide lockdown orders on the same day whereas five states never issued a lockdown order, and using a unique dataset collected from a short video-sharing social media platform, we study the impact of lockdown orders on content creators' behaviors in terms of content volume, content novelty, and content optimism. We combined econometric methods (difference-in-differences estimations of a matched sample) with machine learning-based natural language processing to show that on average, compared to the users residing in non-lockdown states, the users residing in lockdown states create more content after the lockdown order enforcement. However, we find a decrease in the novelty level and optimism of the content generated by the latter group. Our findings have important contributions to the digital resilience literature and shed light on managers' decision-making process related to the adjustment of employees' work mode in the long run.
    Date: 2020–11
  13. By: Marc Chataigner; Stephane Crepey; Jiang Pu
    Abstract: We devise a neural network based compression/completion methodology for financial nowcasting. The latter is meant in a broad sense encompassing completion of gridded values, interpolation, or outlier detection, in the context of financial time series of curves or surfaces (also applicable in higher dimensions, at least in theory). In particular, we introduce an original architecture amenable to the treatment of data defined at variable grid nodes (by far the most common situation in financial nowcasting applications, so that PCA or classical autoencoder methods are not applicable). This is illustrated by three case studies on real data sets. First, we introduce our approach on repo curves data (with moving time-to-maturity as calendar time passes). Second, we show that our approach outperforms elementary interpolation benchmarks on an equity derivative surfaces data set (with moving time-to-maturity again). We also obtain a satisfying performance for outlier detection and surface completion. Third, we benchmark our approach against PCA on at-the-money swaption surfaces redefined at constant expiry/tenor grid nodes. Our approach is then shown to perform as well as (even if not obviously better than) the PCA which, however, is not be applicable to the native, raw data defined on a moving time-to-expiry grid).
    Date: 2020–11
  14. By: Nikita Gusarov (GAEL - Laboratoire d'Economie Appliquée de Grenoble - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes - CNRS - Centre National de la Recherche Scientifique - UGA - Université Grenoble Alpes); Amirreza Talebijamalabad (Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes); Iragaël Joly (GAEL - Laboratoire d'Economie Appliquée de Grenoble - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes - CNRS - Centre National de la Recherche Scientifique - UGA - Université Grenoble Alpes)
    Abstract: This work is a cross-disciplinary study of econometrics and machine learning (ML) models applied to consumer choice preference modelling. To bridge the interdisciplinary gap, a simulation and theorytesting framework is proposed. It incorporates all essential steps from hypothetical setting generation to the comparison of various performance metrics. The flexibility of the framework in theory-testing and models comparison over economics and statistical indicators is illustrated based on the work of Michaud, Llerena and Joly (2012). Two datasets are generated using the predefined utility functions simulating the presence of homogeneous and heterogeneous individual preferences for alternatives' attributes. Then, three models issued from econometrics and ML disciplines are estimated and compared. The study demonstrates the proposed methodological approach's efficiency, successfully capturing the differences between the models issued from different fields given the homogeneous or heterogeneous consumer preferences.
    Keywords: Discrete choice models,Neural network analysis,Performance comparison,Heterogeneous preferences
    Date: 2020–10
  15. By: Matteo Sostero (European Commission - JRC)
    Abstract: The service sector is the current technological frontier of automation, thanks to recent advanced in artificial intelligence and robotics, raising concerns for the future of work for a large segment of the workforce. This report surveys data on the variety and diffusion of service robots in the EU, in order to describe the state of automation in the service sector. Service robots are tangible artefacts of automation technology in the service sector and are relatively well defined by international standards, which makes it easier to track their diffusion. This report uses different data sources to show that the penetration of service robots is currently relatively low in the European economy, especially when compared to industrial robots. Moreover, service robots are used most often for manual tasks, in parts of the service sector that are most similar to manufacturing, such as logistics, inspection and maintenance, and surface cleaning. After comparing the different definitions and variety of service robots, this report proposes a general taxonomy for automation in the service sector, to guide future research.
    Keywords: Service Robots, Automation, Service Sector, Taxonomy, employment
    Date: 2020–12
  16. By: Cevat Giray Aksoy; Panu Poutvaara; Felicitas Schikora
    Abstract: We study the causal effect of local labor market conditions and attitudes towards immigrants at the time of arrival on refugees’ multi-dimensional integration outcomes (economic, linguistic, navigational, political, psychological, and social). Using a unique dataset on refugees, we leverage a centralized allocation policy in Germany where refugees were exogenously assigned to live in specific counties. We find that high initial local unemployment negatively affects refugees’ economic and social integration: they are less likely to be in education or employment and they earn less. We also show that favorable attitudes towards immigrants promote refugees’ economic and social integration. The results suggest that attitudes toward immigrants are as important as local unemployment rates in shaping refugees’ integration outcomes. Using a machine learning classifier algorithm, we find that our results are driven by older people and those with secondary or tertiary education. Our findings highlight the importance of both initial economic and social conditions for facilitating refugee integration, and have implications for the design of centralized allocation policies.
    Keywords: International migration, refugees, integration, allocation policy
    JEL: F22 J15 J24
    Date: 2020
  17. By: Lundborg, Martin; Märkel, Christian; Schrade-Grytsenko, Lisa; Stamm, Peter
    Abstract: Gegenstand der vorliegenden Studie ist der Einsatz von Künstlicher Intelligenz in den Netzsektoren und sich daraus ergebene regulatorische Fragestellungen. Für die Untersuchung wurden Desk Research und Experteninterviews von März bis November 2019 durchgeführt. Die Ergebnisse der Untersuchung zeigen, dass es bereits heute viele potenzielle Anwendungsfelder für KI im Telekommunikationssektor gibt. Der deutsche Telekommunikationsmarkt beschäftigt sich jedoch bisher erst mit einigen ausgewählten Maschine Learning/KI-Anwendungen. Dies liegt vor allem an einem hohen Bedarf an Fachkräften, Know-how und (aufbereiteten) Daten sowie stellenweise noch Unklarheit über den Nutzen dieser Anwendungen. Die Haupttreiber von KI in Telekommunikationsnetzen sind Kosteneinsparungen (OPEX und CAPEX) und Ressourceneffizienz, insbesondere Energieeffizienz. Primäres heutiges Einsatzfeld für Maschine Learning bzw. KI bei den Telekommunikationsunternehmen ist der Kundenservice. Für die Anwendung von KI im Netzsektor wurden in der Studie potenzielle Diskriminierungs- und Transparenzproblematiken identifiziert. Zudem werden Skalenerträge durch KI verstärkt. Ein erhöhtes Marktkonzentrations- und Disruptionspotenzial sind hier die Folge. Es konnten weitere thematische Schnittmengen ausgemacht werden, welche vor dem Hintergrund der Regulierung anknüpfender Forschung bedürfen. Hierzu zählt zum Beispiel die Rolle von KI bei 5G (QoS/Network Slicing) oder bei neuen Cloud Services.
    Date: 2019

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.