nep-big New Economics Papers
on Big Data
Issue of 2023‒09‒25
nineteen papers chosen by
Tom Coupé, University of Canterbury

  1. Applying Machine Learning Algorithms to Predict the Size of the Informal Economy By Joao Felix; Michel Alexandre, Gilberto Tadeu Lima
  2. Can Machine Learning Catch Economic Recessions Using Economic and Market Sentiments? By Kian Tehranian
  3. D-TIPO: Deep time-inconsistent portfolio optimization with stocks and options By Kristoffer Andersson; Cornelis W. Oosterlee
  4. Forecasting inflation using disaggregates and machine learning By Gilberto Boaretto; Marcelo C. Medeiros
  5. Support for Stock Trend Prediction Using Transformers and Sentiment Analysis By Harsimrat Kaeley; Ye QIAO; Nader BAGHERZADEH
  6. Designing an attack-defense game: how to increase robustness of financial transaction models via a competition By Alexey Zaytsev; Alex Natekin; Evgeni Vorsin; Valerii Smirnov; Oleg Sidorshin; Alexander Senin; Alexander Dudin; Dmitry Berestnev
  7. Model-agnostic auditing: a lost cause? By Hansen, Sakina; Loftus, Joshua
  8. Learning to Learn Financial Networks for Optimising Momentum Strategies By Xingyue Pu; Stefan Zohren; Stephen Roberts; Xiaowen Dong
  9. Retail Demand Forecasting: A Comparative Study for Multivariate Time Series By Md Sabbirul Haque; Md Shahedul Amin; Jonayet Miah
  10. American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers By Melissa Dell; Jacob Carlson; Tom Bryan; Emily Silcock; Abhishek Arora; Zejiang Shen; Luca D'Amico-Wong; Quan Le; Pablo Querubin; Leander Heldring
  11. Decode China's Economic Engagement in Africa: Evolving Policies, Investment and Trade Trends, and Implications By Zhang, Yuhan; Mekonnen, Shimelse
  12. Effects of Daily News Sentiment on Stock Price Forecasting By S. Srinivas; R. Gadela; R. Sabu; A. Das; G. Nath; V. Datla
  13. Analysis of CBDC Narrative OF Central Banks using Large Language Models By Andres Alonso-Robisco; Jose Manuel Carbo
  14. Discrimination and Constraints: Evidence from The Voice By Anuar Assamidanov
  15. Do We Price Happiness? Evidence from Korean Stock Market By HyeonJun Kim
  16. Sources of economic policy uncertainty in the euro area: a ready-to-use database By Andrés Azqueta-Gavaldón; Marina Diakonova; Corinna Ghirelli; Javier J. Pérez
  17. Linking microblogging sentiments to stock price movement: An application of GPT-4 By Rick Steinert; Saskia Altmann
  18. To the Moon: Analyzing Collective Trading Events on the Wings of Sentiment Analysis By Tim Matthies; Thomas L\"ohden; Stephan Leible; Jun-Patrick Raabe
  19. NLP-based detection of systematic anomalies among the narratives of consumer complaints By Peiheng Gao; Ning Sun; Xuefeng Wang; Chen Yang; Ri\v{c}ardas Zitikis

  1. By: Joao Felix; Michel Alexandre, Gilberto Tadeu Lima
    Abstract: The use of machine learning models and techniques to predict economic variables has been growing lately, motivated by their better performance when compared to that of linear models. Although linear models have the advantage of considerable interpretive power, efforts have intensified in recent years to make machine learning models more interpretable. In this paper, tests are conducted to determine whether models based on machine learning algorithms have better performance relative to that of linear models for predicting the size of the informal economy. The paper also explores whether the determinants of such size detected as the most important by machine learning models are the same as those detected in the literature based on traditional linear models. For this purpose, observations were collected and processed for 122 countries from 2004 to 2014. Next, eleven models (four linear and seven based on machine learning algorithms) were used to predict the size of the informal economy in these countries. The relative importance of the predictive variables in determining the results yielded by the machine learning algorithms was calculated using Shapley values. The results suggest that (i) models based on machine learning algorithms have better predictive performance than that of linear models and (ii) the main determinants detected through the Shapley values coincide with those detected in the literature using traditional linear models.
    Keywords: : Informal economy; machine learning; linear models; Shapley values
    JEL: C52 C53 O17
    Date: 2023–08–28
  2. By: Kian Tehranian
    Abstract: Quantitative models are an important decision-making factor for policy makers and investors. Predicting an economic recession with high accuracy and reliability would be very beneficial for the society. This paper assesses machine learning technics to predict economic recessions in United States using market sentiment and economic indicators (seventy-five explanatory variables) from Jan 1986 - June 2022 on a monthly basis frequency. In order to solve the issue of missing time-series data points, Autoregressive Integrated Moving Average (ARIMA) method used to backcast explanatory variables. Analysis started with reduction in high dimensional dataset to only most important characters using Boruta algorithm, correlation matrix and solving multicollinearity issue. Afterwards, built various cross-validated models, both probability regression methods and machine learning technics, to predict recession binary outcome. The methods considered are Probit, Logit, Elastic Net, Random Forest, Gradient Boosting, and Neural Network. Lastly, discussed different models performance based on confusion matrix, accuracy and F1 score with potential reasons for their weakness and robustness.
    Date: 2023–08
  3. By: Kristoffer Andersson; Cornelis W. Oosterlee
    Abstract: In this paper, we propose a machine learning algorithm for time-inconsistent portfolio optimization. The proposed algorithm builds upon neural network based trading schemes, in which the asset allocation at each time point is determined by a a neural network. The loss function is given by an empirical version of the objective function of the portfolio optimization problem. Moreover, various trading constraints are naturally fulfilled by choosing appropriate activation functions in the output layers of the neural networks. Besides this, our main contribution is to add options to the portfolio of risky assets and a risk-free bond and using additional neural networks to determine the amount allocated into the options as well as their strike prices. We consider objective functions more in line with the rational preference of an investor than the classical mean-variance, apply realistic trading constraints and model the assets with a correlated jump-diffusion SDE. With an incomplete market and a more involved objective function, we show that it is beneficial to add options to the portfolio. Moreover, it is shown that adding options leads to a more constant stock allocation with less demand for drastic re-allocations.
    Date: 2023–08
  4. By: Gilberto Boaretto; Marcelo C. Medeiros
    Abstract: This paper examines the effectiveness of several forecasting methods for predicting inflation, focusing on aggregating disaggregated forecasts - also known in the literature as the bottom-up approach. Taking the Brazilian case as an application, we consider different disaggregation levels for inflation and employ a range of traditional time series techniques as well as linear and nonlinear machine learning (ML) models to deal with a larger number of predictors. For many forecast horizons, the aggregation of disaggregated forecasts performs just as well survey-based expectations and models that generate forecasts using the aggregate directly. Overall, ML methods outperform traditional time series models in predictive accuracy, with outstanding performance in forecasting disaggregates. Our results reinforce the benefits of using models in a data-rich environment for inflation forecasting, including aggregating disaggregated forecasts from ML techniques, mainly during volatile periods. Starting from the COVID-19 pandemic, the random forest model based on both aggregate and disaggregated inflation achieves remarkable predictive performance at intermediate and longer horizons.
    Date: 2023–08
  5. By: Harsimrat Kaeley (University of California, Irvine); Ye QIAO (University of California, Irvine); Nader BAGHERZADEH (University of California, Irvine)
    Abstract: Stock trend analysis has been an influential time-series prediction topic due to its lucrative and inherently chaotic nature. Many models looking to accurately predict the trend of stocks have been based on Recurrent Neural Networks (RNNs). However, due to the limitations of RNNs, such as gradient vanish and long-term dependencies being lost as sequence length increases, in this paper we develop a Transformer based model that uses technical stock data and sentiment analysis to conduct accurate stock trend prediction over long time windows. This paper also introduces a novel dataset containing daily technical stock data and top news headline data spanning almost three years. Stock prediction based solely on technical data can suffer from lag caused by the inability of stock indicators to effectively factor in breaking market news. The use of sentiment analysis on top headlines can help account for unforeseen shifts in market conditions caused by news coverage. We measure the performance of our model against RNNs over sequence lengths spanning 5 business days to 30 business days to mimic different length trading strategies. This reveals an improvement in directional accuracy over RNNs as sequence length is increased, with the largest improvement being close to 18.63% at 30 business days.
    Keywords: Stock Prediction, Machine Learning, Recurrent Neural Network, LSTM, Transformer, Self Attention, Sentiment, Analysis, Technical Analysis
    JEL: C32 C35 E37
  6. By: Alexey Zaytsev; Alex Natekin; Evgeni Vorsin; Valerii Smirnov; Oleg Sidorshin; Alexander Senin; Alexander Dudin; Dmitry Berestnev
    Abstract: Given the escalating risks of malicious attacks in the finance sector and the consequential severe damage, a thorough understanding of adversarial strategies and robust defense mechanisms for machine learning models is critical. The threat becomes even more severe with the increased adoption in banks more accurate, but potentially fragile neural networks. We aim to investigate the current state and dynamics of adversarial attacks and defenses for neural network models that use sequential financial data as the input. To achieve this goal, we have designed a competition that allows realistic and detailed investigation of problems in modern financial transaction data. The participants compete directly against each other, so possible attacks and defenses are examined in close-to-real-life conditions. Our main contributions are the analysis of the competition dynamics that answers the questions on how important it is to conceal a model from malicious users, how long does it take to break it, and what techniques one should use to make it more robust, and introduction additional way to attack models or increase their robustness. Our analysis continues with a meta-study on the used approaches with their power, numerical experiments, and accompanied ablations studies. We show that the developed attacks and defenses outperform existing alternatives from the literature while being practical in terms of execution, proving the validity of the competition as a tool for uncovering vulnerabilities of machine learning models and mitigating them in various domains.
    Date: 2023–08
  7. By: Hansen, Sakina; Loftus, Joshua
    Abstract: Tools for interpretable machine learning (IML) or explainable artificial intelligence (xAI) can be used to audit algorithms for fairness or other desiderata. In a black-box setting without access to the algorithm’s internal structure an auditor may be limited to methods that are model-agnostic. These methods have severe limitations with important consequences for outcomes such as fairness. Among model-agnostic IML methods, visualizations such as the partial dependence plot (PDP) or individual conditional expectation (ICE) plots are popular and useful for displaying qualitative relationships. Although we focus on fairness auditing with PDP/ICE plots, the consequences we highlight generalize to other auditing or IML/xAI applications. This paper questions the validity of auditing in high-stakes settings with contested values or conflicting interests if the audit methods are model-agnostic.
    Keywords: artificial intelligence; black-box auditing; causal models; CEUR Workshop Proceedings (; counterfactual fairness; individual conditional expectation; machine learning; partial dependence plots; supervised learning; visualization
    JEL: C1
    Date: 2023–07–16
  8. By: Xingyue Pu; Stefan Zohren; Stephen Roberts; Xiaowen Dong
    Abstract: Network momentum provides a novel type of risk premium, which exploits the interconnections among assets in a financial network to predict future returns. However, the current process of constructing financial networks relies heavily on expensive databases and financial expertise, limiting accessibility for small-sized and academic institutions. Furthermore, the traditional approach treats network construction and portfolio optimisation as separate tasks, potentially hindering optimal portfolio performance. To address these challenges, we propose L2GMOM, an end-to-end machine learning framework that simultaneously learns financial networks and optimises trading signals for network momentum strategies. The model of L2GMOM is a neural network with a highly interpretable forward propagation architecture, which is derived from algorithm unrolling. The L2GMOM is flexible and can be trained with diverse loss functions for portfolio performance, e.g. the negative Sharpe ratio. Backtesting on 64 continuous future contracts demonstrates a significant improvement in portfolio profitability and risk control, with a Sharpe ratio of 1.74 across a 20-year period.
    Date: 2023–08
  9. By: Md Sabbirul Haque; Md Shahedul Amin; Jonayet Miah
    Abstract: Accurate demand forecasting in the retail industry is a critical determinant of financial performance and supply chain efficiency. As global markets become increasingly interconnected, businesses are turning towards advanced prediction models to gain a competitive edge. However, existing literature mostly focuses on historical sales data and ignores the vital influence of macroeconomic conditions on consumer spending behavior. In this study, we bridge this gap by enriching time series data of customer demand with macroeconomic variables, such as the Consumer Price Index (CPI), Index of Consumer Sentiment (ICS), and unemployment rates. Leveraging this comprehensive dataset, we develop and compare various regression and machine learning models to predict retail demand accurately.
    Date: 2023–08
  10. By: Melissa Dell; Jacob Carlson; Tom Bryan; Emily Silcock; Abhishek Arora; Zejiang Shen; Luca D'Amico-Wong; Quan Le; Pablo Querubin; Leander Heldring
    Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.
    Date: 2023–08
  11. By: Zhang, Yuhan; Mekonnen, Shimelse
    Abstract: This working paper systematically analyzes the dynamic commercial relationship between China and Africa. Utilizing Natural Language Processing and content analysis of meticulously collected policy documents, this study finds that the Belt and Road Initiative (BRI) and other proposals by President Xi have shaped the policy direction of China-Africa collaborations, highlighting areas like industrial evolution, infrastructure synergies, agricultural modernization, and sustainable development. By exploring historical economic data, this study also finds that the BRI has significantly influenced Chinese financial commitments to Africa, with investment benefiting 30 distinct African countries, spanning sectors beyond natural resources, and involving both state-owned and private entities. Trade data suggests emerging signs of diversification and reveals China's consistent trade surpluses with Africa, influenced by significant Chinese capital outflows. While Africa's emerging signs of diversification are encouraging, it needs to further diversify into manufacturing and services to avoid mirroring past trade patterns with the West. Our machine learning analysis anticipates China-Africa trade to surpass $300 billion by 2025-6. In light of evolving policies and economic trajectories, this study identifies burgeoning opportunities in sectors like e-commerce, fintech, and agritech, underlying the immense potential of China-Africa commercial ties. However, it is important to acknowledge that China-Africa commercial cooperation is not without challenges. Disparities in trade balances, concerns about debt sustainability, and local economic impacts have sometimes strained relations, which require continuing attention and further research.
    Keywords: Machine Learning, Economic Policy, China, Africa, Belt and Road Initiative (BRI), Investment, Trade, Commercial Opportunities
    JEL: A1 A12
    Date: 2023
  12. By: S. Srinivas; R. Gadela; R. Sabu; A. Das; G. Nath; V. Datla
    Abstract: Predicting future prices of a stock is an arduous task to perform. However, incorporating additional elements can significantly improve our predictions, rather than relying solely on a stock's historical price data to forecast its future price. Studies have demonstrated that investor sentiment, which is impacted by daily news about the company, can have a significant impact on stock price swings. There are numerous sources from which we can get this information, but they are cluttered with a lot of noise, making it difficult to accurately extract the sentiments from them. Hence the focus of our research is to design an efficient system to capture the sentiments from the news about the NITY50 stocks and investigate how much the financial news sentiment of these stocks are affecting their prices over a period of time. This paper presents a robust data collection and preprocessing framework to create a news database for a timeline of around 3.7 years, consisting of almost half a million news articles. We also capture the stock price information for this timeline and create multiple time series data, that include the sentiment scores from various sections of the article, calculated using different sentiment libraries. Based on this, we fit several LSTM models to forecast the stock prices, with and without using the sentiment scores as features and compare their performances.
    Date: 2023–08
  13. By: Andres Alonso-Robisco (Banco de España); Jose Manuel Carbo (Banco de España)
    Abstract: Central banks are increasingly using verbal communication for policymaking, focusing not only on traditional monetary policy, but also on a broad set of topics. One such topic is central bank digital currency (CBDC), which is attracting attention from the international community. The complex nature of this project means that it must be carefully designed to avoid unintended consequences, such as financial instability. We propose the use of different Natural Language Processing (NLP) techniques to better understand central banks’ stance towards CBDC, analyzing a set of central bank discourses from 2016 to 2022. We do this using traditional techniques, such as dictionary-based methods, and two large language models (LLMs), namely Bert and ChatGPT, concluding that LLMs better reflect the stance identified by human experts. In particular, we observe that ChatGPT exhibits a higher degree of alignment because it can capture subtler information than BERT. Our study suggests that LLMs are an effective tool to improve sentiment measurements for policy-specific texts, though they are not infallible and may be subject to new risks, like higher sensitivity to the length of texts, and prompt engineering.
    Keywords: ChatGPT, BERT, CBDC, digital money
    JEL: G15 G41 E58
    Date: 2023–08
  14. By: Anuar Assamidanov
    Abstract: Gender discrimination in the hiring process is one significant factor contributing to labor market disparities. However, there is little evidence on the extent to which gender bias by hiring managers is responsible for these disparities. In this paper, I exploit a unique dataset of blind auditions of The Voice television show as an experiment to identify own gender bias in the selection process. The first televised stage audition, in which four noteworthy recording artists are coaches, listens to the contestants blindly (chairs facing away from the stage) to avoid seeing the contestant. Using a difference-in-differences estimation strategy, a coach (hiring person) is demonstrably exogenous with respect to the artist's gender, I find that artists are 4.5 percentage points (11 percent) more likely to be selected when they are the recipients of an opposite-gender coach. I also utilize the machine-learning approach in Athey et al. (2018) to include heterogeneity from team gender composition, order of performance, and failure rates of the coaches. The findings offer a new perspective to enrich past research on gender discrimination, shedding light on the instances of gender bias variation by the gender of the decision maker and team gender composition.
    Date: 2023–08
  15. By: HyeonJun Kim
    Abstract: This study explores the potential of internet search volume data, specifically Google Trends, as an indicator for cross-sectional stock returns. Unlike previous studies, our research specifically investigates the search volume of the topic 'happiness' and its impact on stock returns in the aspect of risk pricing rather than as sentiment measurement. Empirical results indicate that this 'happiness' search exposure (HSE) can explain future returns, particularly for big and value firms. This suggests that HSE might be a reflection of a firm's ability to produce goods or services that meet societal utility needs. Our findings have significant implications for institutional investors seeking to leverage HSE-based strategies for outperformance. Additionally, our research suggests that, when selected judiciously, some search topics on Google Trends can be related to risks that impact stock prices.
    Date: 2023–08
  16. By: Andrés Azqueta-Gavaldón (Banco de España); Marina Diakonova (Banco de España); Corinna Ghirelli (Banco de España); Javier J. Pérez (Banco de España)
    Abstract: In this paper, we build a publicly-available database of economic policy uncertainty (EPU) indicators based on the methodology proposed by Azqueta-Gavaldón, Hirschbühl, Onorante and Saiz (2023), which uses topic modelling techniques to identify distinct components of EPU. This database is regularly updated and can be accessed on the Banco de España’s website. Currently, the dataset covers the four largest countries in the euro area, namely Spain, Italy, France, and Germany. Our data coverage is continually expanding to include more euro area countries. Additionally, we compute the aggregated EPU indexes for the euro area. This comprehensive dataset and the resulting euro area indexes provide valuable tools for researchers, policymakers and analysts to assess and monitor the dynamics of economic policy uncertainty in real time.
    Keywords: economic policy uncertainty, euro area, machine learning, Latent Dirichlet Allocation, word embeddings
    JEL: D80 E20 E66 G18
    Date: 2023–07
  17. By: Rick Steinert; Saskia Altmann
    Abstract: This paper investigates the potential improvement of the GPT-4 Language Learning Model (LLM) in comparison to BERT for modeling same-day daily stock price movements of Apple and Tesla in 2017, based on sentiment analysis of microblogging messages. We recorded daily adjusted closing prices and translated them into up-down movements. Sentiment for each day was extracted from messages on the Stocktwits platform using both LLMs. We develop a novel method to engineer a comprehensive prompt for contextual sentiment analysis which unlocks the true capabilities of modern LLM. This enables us to carefully retrieve sentiments, perceived advantages or disadvantages, and the relevance towards the analyzed company. Logistic regression is used to evaluate whether the extracted message contents reflect stock price movements. As a result, GPT-4 exhibited substantial accuracy, outperforming BERT in five out of six months and substantially exceeding a naive buy-and-hold strategy, reaching a peak accuracy of 71.47 % in May. The study also highlights the importance of prompt engineering in obtaining desired outputs from GPT-4's contextual abilities. However, the costs of deploying GPT-4 and the need for fine-tuning prompts highlight some practical considerations for its use.
    Date: 2023–08
  18. By: Tim Matthies; Thomas L\"ohden; Stephan Leible; Jun-Patrick Raabe
    Abstract: This research investigates the growing trend of retail investors participating in certain stocks by organizing themselves on social media platforms, particularly Reddit. Previous studies have highlighted a notable association between Reddit activity and the volatility of affected stocks. This study seeks to expand the analysis to Twitter, which is among the most impactful social media platforms. To achieve this, we collected relevant tweets and analyzed their sentiment to explore the correlation between Twitter activity, sentiment, and stock volatility. The results reveal a significant relationship between Twitter activity and stock volatility but a weak link between tweet sentiment and stock performance. In general, Twitter activity and sentiment appear to play a less critical role in these events than Reddit activity. These findings offer new theoretical insights into the impact of social media platforms on stock market dynamics, and they may practically assist investors and regulators in comprehending these phenomena better.
    Date: 2023–08
  19. By: Peiheng Gao; Ning Sun; Xuefeng Wang; Chen Yang; Ri\v{c}ardas Zitikis
    Abstract: We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
    Date: 2023–08

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.