nep-big New Economics Papers
on Big Data
Issue of 2022‒06‒27
fifteen papers chosen by
Tom Coupé
University of Canterbury

  1. Nowcasting Growth using Google Trends Data: A Bayesian Structural Time Series Model By Bhattacharjee, Arnab; Kohns, David
  2. How Communication Makes the Difference between a Cartel and Tacit Collusion: A Machine Learning Approach By Maximilian Andres; Lisa Bruttel; Jana Friedrichsen
  3. Research on the correlation between text emotion mining and stock market based on deep learning By Chenrui Zhang
  4. HARNet: A Convolutional Neural Network for Realized Volatility Forecasting By Rafael Reisenhofer; Xandro Bayer; Nikolaus Hautsch
  5. Deep Learning vs. Gradient Boosting: Benchmarking state-of-the-art machine learning algorithms for credit scoring By Marc Schmitt
  6. What constitutes a machine-learning-driven business model? A taxonomy of B2B start-ups with machine learning at their core By Vetter, Oliver A.; Hoffmann, Felix Sebastian; Pumplun, Luisa; Buxmann, Peter
  7. Polytope Fraud Theory By Dongshuai Zhao; Zhongli Wang; Florian Schweizer-Gamborino; Didier Sornette
  8. RLOP: RL Methods in Option Pricing from a Mathematical Perspective By Ziheng Chen
  9. Benchmarking Econometric and Machine Learning Methodologies in Nowcasting By Daniel Hopp
  10. Prescriptive maintenance with causal machine learning By Toon Vanderschueren; Robert Boute; Tim Verdonck; Bart Baesens; Wouter Verbeke
  11. Machine learning in international trade research - evaluating the impact of trade agreements By Holger Breinlich; Valentina Corradi; Nadia Rocha; Michele Ruta; J.M.C. Santos Silva; Tom Zylkin
  12. Mack-Net model: Blending Mack's model with Recurrent Neural Networks By Eduardo Ramos-P\'erez; Pablo J. Alonso-Gonz\'alez; Jos\'e Javier N\'u\~nez-Vel\'azquez
  13. Neural Optimal Stopping Boundary By A. Max Reppen; H. Mete Soner; Valentin Tissot-Daguette
  14. Researcher reasoning meets computational capacity: Machine learning for social science By Lundberg, Ian; Brand, Jennie E.; Jeon, Nanum
  15. Anticompetitive practices on public procurement: Evidence from Brazilian electronic biddings By Adilson Sampaio; Paulo Figueiredo; Klarizze Puzon

  1. By: Bhattacharjee, Arnab; Kohns, David
    Abstract: This paper investigates the benefits of internet search data in the form of Google Trends for nowcasting real U.S. GDP growth in real time through the lens of mixed frequency Bayesian Structural Time Series (BSTS) models. We augment and enhance both model and methodology to make these better amenable to nowcasting with large number of potential covariates. Specifically, we allow shrinking state variances towards zero to avoid overfitting, extend the SSVS (spike and slab variable selection) prior to the more flexible normal-inverse-gamma prior which stays agnostic about the underlying model size, as well as adapt the horseshoe prior to the BSTS. The application to nowcasting GDP growth as well as a simulation study demonstrate that the horseshoe prior BSTS improves markedly upon the SSVS and the original BSTS model with the largest gains in dense data-generating-processes. Our application also shows that a large dimensional set of search terms is able to improve nowcasts early in a specific quarter before other macroeconomic data become available. Search terms with high inclusion probability have good economic interpretation, reflecting leading signals of economic anxiety and wealth effects.
    Keywords: global-local priors, Google trends, non-centred state space, shrinkage
    JEL: C11 C22 C55 E37 E66
    Date: 2022–05
  2. By: Maximilian Andres; Lisa Bruttel; Jana Friedrichsen
    Abstract: This paper sheds new light on the role of communication for cartel formation. Using machine learning to evaluate free-form chat communication among firms in a laboratory experiment, we identify typical communication patterns for both explicit cartel formation and indirect attempts to collude tacitly. We document that firms are less likely to communicate explicitly about price fixing and more likely to use indirect messages when sanctioning institutions are present. This effect of sanctions on communication reinforces the direct cartel-deterring effect of sanctions as collusion is more difficult to reach and sustain without an explicit agreement. Indirect messages have no, or even a negative, effect on prices.
    Keywords: cartel, collusion, communication, machine learning, experiment
    JEL: C92 D43 L41
    Date: 2022
  3. By: Chenrui Zhang
    Abstract: This paper discusses how to crawl the data of financial forums such as stock bar, and conduct emotional analysis combined with the in-depth learning model. This paper will use the Bert model to train the financial corpus and predict the Shenzhen stock index. Through the comparative study of the maximal information coefficient (MIC), it is found that the emotional characteristics obtained by applying the BERT model to the financial corpus can be reflected in the fluctuation of the stock market, which is conducive to effectively improve the prediction accuracy. At the same time, this paper combines in-depth learning with financial texts to further explore the impact mechanism of investor sentiment on the stock market through in-depth learning, which will help the national regulatory authorities and policy departments to formulate more reasonable policies and guidelines for maintaining the stability of the stock market.
    Date: 2022–05
  4. By: Rafael Reisenhofer; Xandro Bayer; Nikolaus Hautsch
    Abstract: Despite the impressive success of deep neural networks in many application areas, neural network models have so far not been widely adopted in the context of volatility forecasting. In this work, we aim to bridge the conceptual gap between established time series approaches, such as the Heterogeneous Autoregressive (HAR) model, and state-of-the-art deep neural network models. The newly introduced HARNet is based on a hierarchy of dilated convolutional layers, which facilitates an exponential growth of the receptive field of the model in the number of model parameters. HARNets allow for an explicit initialization scheme such that before optimization, a HARNet yields identical predictions as the respective baseline HAR model. Particularly when considering the QLIKE error as a loss function, we find that this approach significantly stabilizes the optimization of HARNets. We evaluate the performance of HARNets with respect to three different stock market indexes. Based on this evaluation, we formulate clear guidelines for the optimization of HARNets and show that HARNets can substantially improve upon the forecasting accuracy of their respective HAR baseline models. In a qualitative analysis of the filter weights learnt by a HARNet, we report clear patterns regarding the predictive power of past information. Among information from the previous week, yesterday and the day before, yesterday's volatility makes by far the most contribution to today's realized volatility forecast. Moroever, within the previous month, the importance of single weeks diminishes almost linearly when moving further into the past.
    Date: 2022–05
  5. By: Marc Schmitt
    Abstract: Artificial intelligence (AI) and machine learning (ML) have become vital to remain competitive for financial services companies around the globe. The two models currently competing for the pole position in credit risk management are deep learning (DL) and gradient boosting machines (GBM). This paper benchmarked those two algorithms in the context of credit scoring using three distinct datasets with different features to account for the reality that model choice/power is often dependent on the underlying characteristics of the dataset. The experiment has shown that GBM tends to be more powerful than DL and has also the advantage of speed due to lower computational requirements. This makes GBM the winner and choice for credit scoring. However, it was also shown that the outperformance of GBM is not always guaranteed and ultimately the concrete problem scenario or dataset will determine the final model choice. Overall, based on this study both algorithms can be considered state-of-the-art for binary classification tasks on structured datasets, while GBM should be the go-to solution for most problem scenarios due to easier use, significantly faster training time, and superior accuracy.
    Date: 2022–05
  6. By: Vetter, Oliver A.; Hoffmann, Felix Sebastian; Pumplun, Luisa; Buxmann, Peter
    Date: 2022–06–18
  7. By: Dongshuai Zhao (ETH Zürich - Department of Management, Technology, and Economics (D-MTEC)); Zhongli Wang (Bielefeld University); Florian Schweizer-Gamborino (Price Waterhouse Coopers (PwC)); Didier Sornette (ETH Zürich - Department of Management, Technology, and Economics (D-MTEC); Swiss Finance Institute; Southern University of Science and Technology; Tokyo Institute of Technology)
    Abstract: Polytope Fraud Theory (PFT) extends the existing triangle and diamond theories of accounting fraud with ten abnormal financial practice alarms that a fraudulent firm might trigger. These warning signals are identified through evaluation of the shorting behavior of sophisticated activist short sellers, which are used to train several supervised machine-learning methods in detecting financial statement fraud using published accounting data. Our contributions include a systematic manual collection and labeling of companies that are shorted by professional activist short sellers. We also combine well-known asset pricing factors with accounting red flags in financial features selections. Using 80 percent of the data for training and the remaining 20 percent for out-of-sample test and performance assessment, we find that the best method is XGBoost, with a Recall of 79 percent and F1-score of 85 percent. Other methods have only slightly lower performance, demonstrating the robustness of our results. This shows that the sophisticated activist short sellers, from whom the algorithms are learning, have excellent accounting insights, tremendous forensic analytical knowledge, and sharp business acumen. Our feature importance analysis indicates that potential short-selling targets share many similar financial characteristics, such as bankruptcy or financial distress risk, clustering in some industries, inconsistency of profitability, high accrual, and unreasonable business operations. Our results imply the possible automation of advanced financial statement analysis, which can both improve auditing processes and effectively enhance investment performance. Finally, we propose the Unified Investor Protection Framework, summarizing and categorizing investor-protection related theories from the macro-level to the micro-level.
    Keywords: fraud risk assessment, financial fraud, fraud detection, machine learning
    JEL: C45 C53 M40 M41
    Date: 2022–05
  8. By: Ziheng Chen
    Abstract: Abstract In this work, we build two environments, namely the modified QLBS and RLOP models, from a mathematics perspective which enables RL methods in option pricing through replicating by portfolio. We implement the environment specifications (the source code can be found at, the learning algorithm, and agent parametrization by a neural network. The learned optimal hedging strategy is compared against the BS prediction. The effect of various factors is considered and studied based on how they affect the optimal price and position.
    Date: 2022–05
  9. By: Daniel Hopp
    Abstract: Nowcasting can play a key role in giving policymakers timelier insight to data published with a significant time lag, such as final GDP figures. Currently, there are a plethora of methodologies and approaches for practitioners to choose from. However, there lacks a comprehensive comparison of these disparate approaches in terms of predictive performance and characteristics. This paper addresses that deficiency by examining the performance of 12 different methodologies in nowcasting US quarterly GDP growth, including all the methods most commonly employed in nowcasting, as well as some of the most popular traditional machine learning approaches. Performance was assessed on three different tumultuous periods in US economic history: the early 1980s recession, the 2008 financial crisis, and the COVID crisis. The two best performing methodologies in the analysis were long short-term memory artificial neural networks (LSTM) and Bayesian vector autoregression (BVAR). To facilitate further application and testing of each of the examined methodologies, an open-source repository containing boilerplate code that can be applied to different datasets is published alongside the paper, available at:
    Date: 2022–05
  10. By: Toon Vanderschueren; Robert Boute; Tim Verdonck; Bart Baesens; Wouter Verbeke
    Abstract: Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on strong assumptions regarding the effect of maintenance on the machine's condition, assuming the effect is (1) deterministic or governed by a known probability distribution, and (2) machine-independent. This work proposes to relax both assumptions by learning the effect of maintenance conditional on a machine's characteristics from observational data on similar machines using existing methodologies for causal inference. By predicting the maintenance effect, we can estimate the number of overhauls and failures for different levels of maintenance and, consequently, optimize the preventive maintenance frequency to minimize the total estimated cost. We validate our proposed approach using real-life data on more than 4,000 maintenance contracts from an industrial partner. Empirical results show that our novel, causal approach accurately predicts the maintenance effect and results in individualized maintenance schedules that are more accurate and cost-effective than supervised or non-individualized approaches.
    Date: 2022–06
  11. By: Holger Breinlich; Valentina Corradi; Nadia Rocha; Michele Ruta; J.M.C. Santos Silva; Tom Zylkin
    Abstract: Modern trade agreements contain a large number of provisions in addition to tariff reductions, in areas as diverse as services trade, competition policy, trade-related investment measures, or public procurement. Existing research has struggled with overfitting and severe multicollinearity problems when trying to estimate the effects of these provisions on trade flows. Building on recent developments in the machine learning and variable selection literature, this paper proposes data-driven methods for selecting the most important provisions and quantifying their impact on trade flows, without the need of making ad hoc assumptions on how to aggregate individual provisions. The analysis finds that provisions related to antidumping, competition policy, technical barriers to trade, and trade facilitation are associated with enhancing the trade-increasing effect of trade agreements.
    Keywords: lasso, machine learning, preferential trade agreements, deep trade agreements
    Date: 2021–06–16
  12. By: Eduardo Ramos-P\'erez; Pablo J. Alonso-Gonz\'alez; Jos\'e Javier N\'u\~nez-Vel\'azquez
    Abstract: In general insurance companies, a correct estimation of liabilities plays a key role due to its impact on management and investing decisions. Since the Financial Crisis of 2007-2008 and the strengthening of regulation, the focus is not only on the total reserve but also on its variability, which is an indicator of the risk assumed by the company. Thus, measures that relate profitability with risk are crucial in order to understand the financial position of insurance firms. Taking advantage of the increasing computational power, this paper introduces a stochastic reserving model whose aim is to improve the performance of the traditional Mack's reserving model by applying an ensemble of Recurrent Neural Networks. The results demonstrate that blending traditional reserving models with deep and machine learning techniques leads to a more accurate assessment of general insurance liabilities.
    Date: 2022–05
  13. By: A. Max Reppen; H. Mete Soner; Valentin Tissot-Daguette
    Abstract: A method based on deep artificial neural networks and empirical risk minimization is developed to calculate the boundary separating the stopping and continuation regions in optimal stopping. The algorithm parameterizes the stopping boundary as the graph of a function and introduces relaxed stopping rules based on fuzzy boundaries to facilitate efficient optimization. Several financial instruments, some in high dimensions, are analyzed through this method, demonstrating its effectiveness. The existence of the stopping boundary is also proved under natural structural assumptions.
    Date: 2022–05
  14. By: Lundberg, Ian; Brand, Jennie E. (UCLA); Jeon, Nanum
    Abstract: Computational power and digital data have created new opportunities to explore and understand the social world. A special synergy is possible when social scientists combine human attention to certain aspects of the problem with the power of algorithms to automate other aspects of the problem. We review selected exemplary applications where machine learning amplifies researcher coding, summarizes complex data, relaxes statistical assumptions, and targets researcher attention. We then seek to reduce perceived barriers to machine learning by summarizing several fundamental building blocks and their grounding in classical statistics. We present a few guiding principles and promising approaches where we see particular potential for machine learning to transform social science inquiry. We conclude that machine learning tools are accessible, worthy of attention, and ready to yield new discoveries.
    Date: 2022–05–23
  15. By: Adilson Sampaio; Paulo Figueiredo; Klarizze Puzon
    Abstract: Using big data from the Brazilian public procurement system, this research aims to investigate what factors are associated with the occurrence of anticompetitive practices in electronic bidding. Our analysis considers all services contracted between 2014 and 2017.
    Keywords: Competition, Procurement, Fraud, Brazil
    Date: 2022

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.