nep-big New Economics Papers
on Big Data
Issue of 2023‒08‒21
twenty papers chosen by
Tom Coupé, University of Canterbury

  1. For What It's Worth: Measuring Land Value in the Era of Big Data and Machine Learning By Scott Wentland; Gary Cornwall; Jeremy G. Moulton
  2. Comparative Analysis of Machine Learning, Hybrid, and Deep Learning Forecasting Models Evidence from European Financial Markets and Bitcoins By Apostolos Ampountolas
  3. Stochastic Delay Differential Games: Financial Modeling and Machine Learning Algorithms By Robert Balkin; Hector D. Ceniceros; Ruimeng Hu
  4. Machine learning for option pricing: an empirical investigation of network architectures By Laurens Van Mieghem; Antonis Papapantoleon; Jonas Papazoglou-Hennig
  5. analysis of the predictor of a volatility surface by machine learning By Valentin Lourme
  6. Panel Data Nowcasting: The Case of Price-Earnings Ratios By Andrii Babii; Ryan T. Ball; Eric Ghysels; Jonas Striaukas
  7. Harnessing the Potential of Volatility: Advancing GDP Prediction By Ali Lashgari
  8. Ideas Without Scale in French Artificial Intelligence Innovations By Johanna Deperi; Ludovic Dibiaggio; Mohamed Keita; Lionel Nesta
  9. Decomposing Climate Risks in Stock Markets By Yuanchen Yang; Chengyu Huang; Yuchen Zhang
  10. Critical comparisons on deep learning approaches for foreign exchange rate prediction By Zhu Bangyuan
  11. Deep Inception Networks: A General End-to-End Framework for Multi-asset Quantitative Strategies By Tom Liu; Stephen Roberts; Stefan Zohren
  12. Patent news shocks help forecast establishment dynamics By Mu-Jeung Yang
  13. Power boost or source of bias? Monte Carlo evidence on ML covariate adjustment in randomized trials in education By Lukas Fervers
  14. Pattern Mining for Anomaly Detection in Graphs: Application to Fraud in Public Procurement By Lucas Potin; Rosa Figueiredo; Vincent Labatut; Christine Largeron
  15. Company2Vec -- German Company Embeddings based on Corporate Websites By Christopher Gerling
  16. Do daily lead texts help nowcasting GDP growth? By Marc Burri
  17. Quantifying priorities in business cycle reports: Analysis of recurring textual patterns around peaks and troughs By Foltas, Alexander
  18. Using GPT-4 for Financial Advice By Christian Fieberg; Lars Hornuf; David J. Streich
  19. Evaluation of Deep Reinforcement Learning Algorithms for Portfolio Optimisation By Chung I Lu
  20. Testing for the Markov property in time series via deep conditional generative learning By Shi, Chengchun

  1. By: Scott Wentland; Gary Cornwall; Jeremy G. Moulton
    Abstract: This paper develops a new method for valuing land, a key asset on a nation’s balance sheet. The method first employs an unsupervised machine learning method, kmeans clustering, to discretize unobserved heterogeneity, which we then combine with a supervised learning algorithm, gradient boosted trees (GBT), to obtain property-level price predictions and estimates of the land component. Our initial results from a large national dataset show this approach routinely outperforms hedonic regression methods (as used by the U.K.’s Office for National Statistics, for example) in out-of-sample price predictions. To exploit the best of both methods, we further explore a composite approach using model stacking, finding it outperforms all methods in out-of-sample tests and a benchmark test against nearby vacant land sales. In an application, we value residential, commercial, industrial, and agricultural land for the entire contiguous U.S. from 2006-2015. The results offer new insights into valuation and demonstrate how a unified method can build national and subnational estimates of land value from detailed, parcel-level data. We discuss further applications to economic policy and the property valuation literature more generally.
    JEL: E01
    Date: 2023–06
  2. By: Apostolos Ampountolas
    Abstract: This study analyzes the transmission of market uncertainty on key European financial markets and the cryptocurrency market over an extended period, encompassing the pre, during, and post-pandemic periods. Daily financial market indices and price observations are used to assess the forecasting models. We compare statistical, machine learning, and deep learning forecasting models to evaluate the financial markets, such as the ARIMA, hybrid ETS-ANN, and kNN predictive models. The study results indicate that predicting financial market fluctuations is challenging, and the accuracy levels are generally low in several instances. ARIMA and hybrid ETS-ANN models perform better over extended periods compared to the kNN model, with ARIMA being the best-performing model in 2018-2021 and the hybrid ETS-ANN model being the best-performing model in most of the other subperiods. Still, the kNN model outperforms the others in several periods, depending on the observed accuracy measure. Researchers have advocated using parametric and non-parametric modeling combinations to generate better results. In this study, the results suggest that the hybrid ETS-ANN model is the best-performing model despite its moderate level of accuracy. Thus, the hybrid ETS-ANN model is a promising financial time series forecasting approach. The findings offer financial analysts an additional source that can provide valuable insights for investment decisions.
    Date: 2023–07
  3. By: Robert Balkin; Hector D. Ceniceros; Ruimeng Hu
    Abstract: In this paper, we propose a numerical methodology for finding the closed-loop Nash equilibrium of stochastic delay differential games through deep learning. These games are prevalent in finance and economics where multi-agent interaction and delayed effects are often desired features in a model, but are introduced at the expense of increased dimensionality of the problem. This increased dimensionality is especially significant as that arising from the number of players is coupled with the potential infinite dimensionality caused by the delay. Our approach involves parameterizing the controls of each player using distinct recurrent neural networks. These recurrent neural network-based controls are then trained using a modified version of Brown's fictitious play, incorporating deep learning techniques. To evaluate the effectiveness of our methodology, we test it on finance-related problems with known solutions. Furthermore, we also develop new problems and derive their analytical Nash equilibrium solutions, which serve as additional benchmarks for assessing the performance of our proposed deep learning approach.
    Date: 2023–07
  4. By: Laurens Van Mieghem; Antonis Papapantoleon; Jonas Papazoglou-Hennig
    Abstract: We consider the supervised learning problem of learning the price of an option or the implied volatility given appropriate input data (model parameters) and corresponding output data (option prices or implied volatilities). The majority of articles in this literature considers a (plain) feed forward neural network architecture in order to connect the neurons used for learning the function mapping inputs to outputs. In this article, motivated by methods in image classification and recent advances in machine learning methods for PDEs, we investigate empirically whether and how the choice of network architecture affects the accuracy and training time of a machine learning algorithm. We find that for option pricing problems, where we focus on the Black--Scholes and the Heston model, the generalized highway network architecture outperforms all other variants, when considering the mean squared error and the training time as criteria. Moreover, for the computation of the implied volatility, after a necessary transformation, a variant of the DGM architecture outperforms all other variants, when considering again the mean squared error and the training time as criteria.
    Date: 2023–07
  5. By: Valentin Lourme (Arts et Métiers ParisTech, Natixis)
    Abstract: The purpose of this study is to compare two approaches to assessing the points of a volatility layer. The first approach used is cubic spline interpolation, while the second approach is a machine learning algorithm, the XGBoost. The purpose of this comparison is to define the use case where the XGBoost Learning machine algorithm is more suitable compared to the cubic spline. The comparison between the two approaches is measured with the error between the measured volatility and the interpolated or predicted volatility. Cubic spline interpolation requires volatility data on the day of the study for interpolation to occur. The XGBoost Machine Learning algorithm will train on historical data to predict the volatility value on the day of the study.
    Abstract: Cette étude vise à comparer deux approches d'évaluation des points d'une nappe de volatilité. La première approche utilisée est l'interpolation par spline cubique, tandis que la seconde approche est un algorithme de machine Learning, le XGBoost. Cette comparaison a pour but de définir le cas d'utilisation ou l'algorithme de machine Learning XGBoost est plus adapté par rapport au spline cubique. La comparaison entre les deux approches est mesurée avec l'erreur entre la volatilité mesurée et la volatilité interpolée ou prédite. L'interpolation par spline cubique nécessite les données de volatilité au jour de l'étude pour que l'interpolation soit réalisée. L'algorithme de Machine Learning XGBoost va s'entrainer sur des données historiques pour prédire la valeur de volatilité au jour de l'étude
    Date: 2023–07–05
  6. By: Andrii Babii; Ryan T. Ball; Eric Ghysels; Jonas Striaukas
    Abstract: The paper uses structured machine learning regressions for nowcasting with panel data consisting of series sampled at different frequencies. Motivated by the problem of predicting corporate earnings for a large cross-section of firms with macroeconomic, financial, and news time series sampled at different frequencies, we focus on the sparse-group LASSO regularization which can take advantage of the mixed frequency time series panel data structures. Our empirical results show the superior performance of our machine learning panel data regression models over analysts' predictions, forecast combinations, firm-specific time series regression models, and standard machine learning methods.
    Date: 2023–07
  7. By: Ali Lashgari
    Abstract: This paper presents a novel machine learning approach to GDP prediction that incorporates volatility as a model weight. The proposed method is specifically designed to identify and select the most relevant macroeconomic variables for accurate GDP prediction, while taking into account unexpected shocks or events that may impact the economy. The proposed method's effectiveness is tested on real-world data and compared to previous techniques used for GDP forecasting, such as Lasso and Adaptive Lasso. The findings show that the Volatility-weighted Lasso method outperforms other methods in terms of accuracy and robustness, providing policymakers and analysts with a valuable tool for making informed decisions in a rapidly changing economic environment. This study demonstrates how data-driven approaches can help us better understand economic fluctuations and support more effective economic policymaking. Keywords: GDP prediction, Lasso, Volatility, Regularization, Macroeconomics Variable Selection, Machine Learning JEL codes: C22, C53, E37.
    Date: 2023–06
  8. By: Johanna Deperi (University of Brescia); Ludovic Dibiaggio (SKEMA Business School); Mohamed Keita (SKEMA Business School); Lionel Nesta (GREDEG - Groupe de Recherche en Droit, Economie et Gestion - UNS - Université Nice Sophia Antipolis (1965 - 2019) - COMUE UCA - COMUE Université Côte d'Azur (2015-2019) - CNRS - Centre National de la Recherche Scientifique - UCA - Université Côte d'Azur, OFCE - Observatoire français des conjonctures économiques (Sciences Po) - Sciences Po - Sciences Po)
    Abstract: Artificial intelligence (AI) is viewed as the next technological revolution. The aim of this Policy Brief is to identify France's strengths and weaknesses in this great race for AI innovation. We characterise France's positioning relative to other key players and make the following observations: 1. Without being a world leader in innovation incorporating artificial intelligence, France is showing moderate but significant activity in this field. 2. France specialises in machine learning, unsupervised learning and probabilistic graphical models, and in developing solutions for the medical sciences, transport and security. 3. The AI value chain in France is poorly integrated, mainly due to a lack of integration in the downstream phases of the innovation chain. 4. The limited presence of French private players in the global AI arena contrasts with the extensive involvement of French public institutions. French public research organisations produce patents with great economic value. 5. Public players are the key actors in French networks for collaboration in patent development, but are not open to international and institutional diversity. In our opinion, France runs the risk of becoming a global AI laboratory located upstream in the AI innovation value chain. As such, it is likely to bear the sunk costs of AI invention, without enjoying the benefits of AI exploitation on a larger scale. In short, our fear is that French AI will be exported to other locations to prosper and grow.
    Date: 2023–06–26
  9. By: Yuanchen Yang; Chengyu Huang; Yuchen Zhang
    Abstract: Climate change poses an unprecedented challenge to the world economy and the global financial system. This paper sets out to understand and quantify the impact of climate mitigation, with a focus on climate-related news, which represents an important information source that investors use to revise their subjective assessments of climate risks. Using full-text data from Financial Times from January 2005 to March 2022, we develop machine learning-based indicators to measure risks from climate mitigation, and the direction of the risk is identified through manual labels. The documented risk premium indicates that climate mitigation news has been partially priced in the Canadian stock market. More specifically, stock prices react positively to market-wide climate-favorable news but they do not react negatively to climate-unfavorable news. The results are robust to different model specifications and across equity markets.
    Keywords: Climate Mitigation; Machine Learning; Asset Pricing
    Date: 2023–06–30
  10. By: Zhu Bangyuan
    Abstract: In a natural market environment, the price prediction model needs to be updated in real time according to the data obtained by the system to ensure the accuracy of the prediction. In order to improve the user experience of the system, the price prediction function needs to use the fastest training model and the model prediction fitting effect of the best network as a predictive model. We conduct research on the fundamental theories of RNN, LSTM, and BP neural networks, analyse their respective characteristics, and discuss their advantages and disadvantages to provide a reference for the selection of price-prediction models.
    Date: 2023–07
  11. By: Tom Liu; Stephen Roberts; Stefan Zohren
    Abstract: We introduce Deep Inception Networks (DINs), a family of Deep Learning models that provide a general framework for end-to-end systematic trading strategies. DINs extract time series (TS) and cross sectional (CS) features directly from daily price returns. This removes the need for handcrafted features, and allows the model to learn from TS and CS information simultaneously. DINs benefit from a fully data-driven approach to feature extraction, whilst avoiding overfitting. Extending prior work on Deep Momentum Networks, DIN models directly output position sizes that optimise Sharpe ratio, but for the entire portfolio instead of individual assets. We propose a novel loss term to balance turnover regularisation against increased systemic risk from high correlation to the overall market. Using futures data, we show that DIN models outperform traditional TS and CS benchmarks, are robust to a range of transaction costs and perform consistently across random seeds. To balance the general nature of DIN models, we provide examples of how attention and Variable Selection Networks can aid the interpretability of investment decisions. These model-specific methods are particularly useful when the dimensionality of the input is high and variable importance fluctuates dynamically over time. Finally, we compare the performance of DIN models on other asset classes, and show how the space of potential features can be customised.
    Date: 2023–07
  12. By: Mu-Jeung Yang
    Abstract: Several ongoing survey programs by the US Census Bureau are based on sampling of establishments based on forecasted size. Current practice by the Census is to mainly rely on past size as predictor of future size. This project uses responsiveness to patent news shocks as additional forecast variables for establishment size and evaluates using such variables by showing out-of-sample prediction performance using machine learning.
    Date: 2023–07
  13. By: Lukas Fervers (University of Cologne and Leibniz-Centre for Life-Long Learning)
    Abstract: Statistical theory makes ambiguous predictions about covariate adjustment in randomized trials. While proponents highlight possible efficiency gains, opponents point to possible finite-sample bias, a loss of precision in the case of many and weak covariates, and as the increasing danger of false-positive results due to repeated model specification. This theoretical reasoning suggests that machine learning (variable selection) methods may be promising tools to keep the advantages of covariate adjustment, while simultaneously protecting against its downsides. In this presentation, I rely on recent developments of machine learning methods for causal effects and their implementation in Stata to assess the performance of ML methods in randomized trials. I rely on real-world data and simulate treatment effects on a wide range of different data structures, including different outcomes and sample sizes. (Preliminary) results suggests that ML adjusted estimates are unbiased and show considerable efficiency gains compared with unadjusted analysis. The results are fairly similar between different data structures used and robust to the choice of tuning parameters of the ML estimators. These results tend to support the more optimistic view on covariate adjustment and highlight the potential of ML methods in this field.
    Date: 2023–06–15
  14. By: Lucas Potin (LIA - Laboratoire Informatique d'Avignon - AU - Avignon Université - Centre d'Enseignement et de Recherche en Informatique - CERI); Rosa Figueiredo (LIA - Laboratoire Informatique d'Avignon - AU - Avignon Université - Centre d'Enseignement et de Recherche en Informatique - CERI); Vincent Labatut (LIA - Laboratoire Informatique d'Avignon - AU - Avignon Université - Centre d'Enseignement et de Recherche en Informatique - CERI); Christine Largeron (LHC - Laboratoire Hubert Curien - IOGS - Institut d'Optique Graduate School - UJM - Université Jean Monnet - Saint-Étienne - CNRS - Centre National de la Recherche Scientifique)
    Abstract: In the context of public procurement, several indicators called red flags are used to estimate fraud risk. They are computed according to certain contract attributes and are therefore dependent on the proper filling of the contract and award notices. However, these attributes are very often missing in practice, which prohibits red flags computation. Traditional fraud detection approaches focus on tabular data only, considering each contract separately, and are therefore very sensitive to this issue. In this work, we adopt a graph-based method allowing leveraging relations between contracts, to compensate for the missing attributes. We propose PANG (Pattern-Based Anomaly Detection in Graphs), a general supervised framework relying on pattern extraction to detect anomalous graphs in a collection of attributed graphs. Notably, it is able to identify induced subgraphs, a type of pattern widely overlooked in the literature. When benchmarked on standard datasets, its predictive performance is on par with state-of-the-art methods, with the additional advantage of being explainable. These experiments also reveal that induced patterns are more discriminative on certain datasets. When applying PANG to public procurement data, the prediction is superior to other methods, and it identifies subgraph patterns that are characteristic of fraud-prone situations, thereby making it possible to better understand fraudulent behavior.
    Keywords: Pattern Mining, Graph Classification, Public Procurement, Fraud Detection
    Date: 2023–09–18
  15. By: Christopher Gerling
    Abstract: With Company2Vec, the paper proposes a novel application in representation learning. The model analyzes business activities from unstructured company website data using Word2Vec and dimensionality reduction. Company2Vec maintains semantic language structures and thus creates efficient company embeddings in fine-granular industries. These semantic embeddings can be used for various applications in banking. Direct relations between companies and words allow semantic business analytics (e.g. top-n words for a company). Furthermore, industry prediction is presented as a supervised learning application and evaluation method. The vectorized structure of the embeddings allows measuring companies similarities with the cosine distance. Company2Vec hence offers a more fine-grained comparison of companies than the standard industry labels (NACE). This property is relevant for unsupervised learning tasks, such as clustering. An alternative industry segmentation is shown with k-means clustering on the company embeddings. Finally, this paper proposes three algorithms for (1) firm-centric, (2) industry-centric and (3) portfolio-centric peer-firm identification.
    Date: 2023–07
  16. By: Marc Burri
    Abstract: This paper evaluates whether publicly available daily news lead texts help nowcasting Swiss GDP growth. I collect titles and lead texts from three Swiss newspapers and calculate text-based indicators for various economic concepts. A composite indicator calculated from these indicators is highly correlated with low-frequency macroeconomic data and survey-based indicators. In a pseudo out-of-sample nowcasting exercise for Swiss GDP growth, the indicator outperforms a monthly Swiss business cycle indicator if one month of information is available. Improvements in nowcasting accuracy mainly occur in times of economic distress.
    Keywords: Mixed-frequency data, composite leading indicator, news sentiment, recession, natural language processing, nowcasting
    JEL: C53 E32 E37
    Date: 2023–07
  17. By: Foltas, Alexander
    Abstract: I propose a novel approach to uncover business cycle reports' priorities and relate them to economic fluctuations. To this end, I leverage quantitative business-cycle forecasts published by leading German economic research institutes since 1970 to estimate the proportions of latent topics in associated business cycle reports. I then employ a supervised approach to aggregate topics with similar themes, thus revealing the proportions of broader macroeconomic subjects. I obtain measures of forecasters' subject priorities by extracting the subject proportions' cyclic components. Correlating these priorities with key macroeconomic variables reveals consistent priority patterns throughout economic peaks and troughs. The forecasters prioritize inflation-related matters over recession-related considerations around peaks. This finding suggests that forecasters underestimate growth and overestimate inflation risks during contractive monetary policies, which might explain their failure to predict recessions. Around troughs, forecasters prioritize investment matters, potentially suggesting a better understanding of macroeconomic developments during those periods compared to peaks.
    Keywords: Macroeconomic forecasting, Evaluating forecasts, Business cycles, Recession forecasting, Topic Modeling, Natural language processing, Machine learning, Judgemental forecasting
    JEL: E32 E37 C45
    Date: 2023
  18. By: Christian Fieberg; Lars Hornuf; David J. Streich
    Abstract: We show that the recently released text-based artificial intelligence tool GPT-4 can provide suitable financial advice. The tool suggests specific investment portfolios that reflect an investor’s individual circumstances such as risk tolerance, risk capacity, and sustainability preference. Notably, while the suggested portfolios display home bias and are rather insensitive to the investment horizon, historical risk-adjusted performance is on par with a professionally managed benchmark portfolio. Given the current inability of GPT-4 to provide full-service financial advice, it may be used by financial advisors as a back-office tool for portfolio recommendation.
    Keywords: GPT-4, ChatGPT, financial advice, artificial intelligence, portfolio management
    JEL: G00 G11
    Date: 2023
  19. By: Chung I Lu
    Abstract: We evaluate benchmark deep reinforcement learning (DRL) algorithms on the task of portfolio optimisation under a simulator. The simulator is based on correlated geometric Brownian motion (GBM) with the Bertsimas-Lo (BL) market impact model. Using the Kelly criterion (log utility) as the objective, we can analytically derive the optimal policy without market impact and use it as an upper bound to measure performance when including market impact. We found that the off-policy algorithms DDPG, TD3 and SAC were unable to learn the right Q function due to the noisy rewards and therefore perform poorly. The on-policy algorithms PPO and A2C, with the use of generalised advantage estimation (GAE), were able to deal with the noise and derive a close to optimal policy. The clipping variant of PPO was found to be important in preventing the policy from deviating from the optimal once converged. In a more challenging environment where we have regime changes in the GBM parameters, we found that PPO, combined with a hidden Markov model (HMM) to learn and predict the regime context, is able to learn different policies adapted to each regime. Overall, we find that the sample complexity of these algorithms is too high, requiring more than 2m steps to learn a good policy in the simplest setting, which is equivalent to almost 8, 000 years of daily prices.
    Date: 2023–07
  20. By: Shi, Chengchun
    Abstract: The Markov property is widely imposed in analysis of time series data. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance. In this article, we propose a nonparametric test for the Markov property in high-dimensional time series via deep conditional generative learning. We also apply the test sequentially to determine the order of the Markov model. We show that the test controls the type-I error asymptotically, and has the power approaching one. Our proposal makes novel contributions in several ways. We utilize and extend state-of-the-art deep generative learning to estimate the conditional density functions, and establish a sharp upper bound on the approximation error of the estimators. We derive a doubly robust test statistic, which employs a nonparametric estimation but achieves a parametric convergence rate. We further adopt sample splitting and cross-fitting to minimize the conditions required to ensure the consistency of the test. We demonstrate the efficacy of the test through both simulations and the three data applications.
    Keywords: deep conditional generative learning; high-dimensional time series; hypothesis testing; Markov property; mixture density network; OUP deal
    JEL: C1
    Date: 2023–06–23

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.