nep-big New Economics Papers
on Big Data
Issue of 2024‒10‒21
eighteen papers chosen by
Tom Coupé, University of Canterbury


  1. Bellwether Trades: Characteristics of Trades influential in Predicting Future Price Movements in Markets By Tejas Ramdas; Martin T. Wells
  2. Double machine learning and Stata application By Chen Qiang
  3. Semi-strong Efficient Market of Bitcoin and Twitter: an Analysis of Semantic Vector Spaces of Extracted Keywords and Light Gradient Boosting Machine Models By Fang Wang; Marko Gacesa
  4. Development and validation of a real-time happiness index using Google TrendsTM By Greyling, Talita; Rossouw, Stephanié
  5. Predicting Foreign Exchange EUR/USD direction using machine learning By Kevin Cedric Guyard; Michel Deriaz
  6. CB-LMs: language models for central banking By Leonardo Gambacorta; Byeungchun Kwon; Taejin Park; Pietro Patelli; Sonya Zhu
  7. MLP, XGBoost, KAN, TDNN, and LSTM-GRU Hybrid RNN with Attention for SPX and NDX European Call Option Pricing By Boris Ter-Avanesov; Homayoon Beigi
  8. Mining Chinese Historical Sources At Scale: A Machine Learning-Approach to Qing State Capacity By Wolfgang Keller; Carol H. Shiue; Sen Yan
  9. Credit Scores: Performance and Equity By Stefania Albanesi; Domonkos F. Vamossy
  10. Big loans to small businesses: predicting winners and losers in an entrepreneurial lending experiment By Bryan, Gharad; Karlan, Dean; Osman, Adam
  11. Robust financial calibration: a Bayesian approach for neural SDEs By Christa Cuchiero; Eva Flonner; Kevin Kurt
  12. KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models By Neel Rajani; Lilli Kiessling; Aleksandr Ogaltsov; Claus Lang
  13. Anatomy of Machines for Markowitz: Decision-Focused Learning for Mean-Variance Portfolio Optimization By Junhyeong Lee; Inwoo Tae; Yongjae Lee
  14. Robust Reinforcement Learning with Dynamic Distortion Risk Measures By Anthony Coache; Sebastian Jaimungal
  15. Analysis of the expectation dynamics and discourse on the metaverse using X(Twitter) data By Yang, Kyongchyol; Jung, Jaemin
  16. The multimodal emotion information analysis of e-commerce online pricing in electronic word of mouth By Chen, Jinyu; Zhong, Ziqi; Feng, Qindi; Liu, Lei
  17. Analysis of the leading Bitcoin forum with large language models highlights the enduring and substantial carbon footprint of Bitcoin By Cyrille Grumbach; Didier Sornette
  18. Global Stock Market Volatility Forecasting Incorporating Dynamic Graphs and All Trading Days By Zhengyang Chi; Junbin Gao; Chao Wang

  1. By: Tejas Ramdas; Martin T. Wells
    Abstract: In this study, we leverage powerful non-linear machine learning methods to identify the characteristics of trades that contain valuable information. First, we demonstrate the effectiveness of our optimized neural network predictor in accurately predicting future market movements. Then, we utilize the information from this successful neural network predictor to pinpoint the individual trades within each data point (trading window) that had the most impact on the optimized neural network's prediction of future price movements. This approach helps us uncover important insights about the heterogeneity in information content provided by trades of different sizes, venues, trading contexts, and over time.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.05192
  2. By: Chen Qiang (Shandong University)
    Abstract: Traditional methods for estimating treatment effects generally assume strong functional forms and are only applicable when the covariates are low-dimensional data. However, using machine learning methods directly often leads to "regularization bias". The recently emerging "double/debiased machine learning" provides an effective estimation method without assuming a functional form and is suitable for high-dimensional data. This presentation will introduce the principles of dual machine learning in a simple way and demonstrate the corresponding Stata operations with classic cases.
    Date: 2024–10–02
    URL: https://d.repec.org/n?u=RePEc:boc:chin23:03
  3. By: Fang Wang (Florence Wong); Marko Gacesa
    Abstract: This study extends the examination of the Efficient-Market Hypothesis in Bitcoin market during a five year fluctuation period, from September 1 2017 to September 1 2022, by analyzing 28, 739, 514 qualified tweets containing the targeted topic "Bitcoin". Unlike previous studies, we extracted fundamental keywords as an informative proxy for carrying out the study of the EMH in the Bitcoin market rather than focusing on sentiment analysis, information volume, or price data. We tested market efficiency in hourly, 4-hourly, and daily time periods to understand the speed and accuracy of market reactions towards the information within different thresholds. A sequence of machine learning methods and textual analyses were used, including measurements of distances of semantic vector spaces of information, keywords extraction and encoding model, and Light Gradient Boosting Machine (LGBM) classifiers. Our results suggest that 78.06% (83.08%), 84.63% (87.77%), and 94.03% (94.60%) of hourly, 4-hourly, and daily bullish (bearish) market movements can be attributed to public information within organic tweets.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.15988
  4. By: Greyling, Talita; Rossouw, Stephanié
    Abstract: It is well-established that a country's economic outcomes, including productivity, future income, and labour market performance, are profoundly influenced by the happiness of its people. Traditionally, survey data have been the primary source for determining people's happiness. However, this approach faces challenges as individuals increasingly experience "survey fatigue"; conducting surveys is costly, data generated from surveys is only available with a significant time lag, and happiness is not a constant state. To address these limitations of survey data, Big Data collected from online sources like Google Trends™ and social media platforms have emerged as a significant and necessary data source to complement traditional survey data. This alternative data source can give policymakers more timely information on people's happiness, well-being or any other issue. In recent years, Google Trends™ data has been leveraged to discern trends in mental health, including depression, anxiety, and loneliness and to construct robust predictors of subjective well-being composite categories. We aim to develop a methodology to construct the first comprehensive, near real-time measure of population-level happiness using information-seeking query data extracted continuously using Google Trends™ in countries. We use a basket of English-language emotion words suggested to capture positive and negative affect based on the literature reviewed. To derive the equation for estimating happiness in a country, we employ machine learning algorithms XGBoost and ElasticNet to determine the most important words and weight the happiness equation, respectively. We use the United Kingdom's ONS (weekly and quarterly) data to demonstrate our methodology. Next, we translate the basket of words into Dutch and apply the same equation to test if the same words and weights can be used in a different country (the Netherlands) to estimate happiness. Lastly, we improve the fit for the Netherlands by incorporating country-specific emotion words. Evaluating the accuracy of our estimated happiness in countries against survey data, we find a very good fit with very low error metrics. If we add country-specific words, we improve the fit statistics. Our suggested methodology shows that emotion words extracted from Google Trends™ can accurately estimate a country's level of happiness.
    Keywords: Happiness, Google Trends™, Big Data, XGBoost, machine learning
    JEL: C53 C55 I31
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:zbw:glodps:1493
  5. By: Kevin Cedric Guyard; Michel Deriaz
    Abstract: The Foreign Exchange market is a significant market for speculators, characterized by substantial transaction volumes and high volatility. Accurately predicting the directional movement of currency pairs is essential for formulating a sound financial investment strategy. This paper conducts a comparative analysis of various machine learning models for predicting the daily directional movement of the EUR/USD currency pair in the Foreign Exchange market. The analysis includes both decorrelated and non-decorrelated feature sets using Principal Component Analysis. Additionally, this study explores meta-estimators, which involve stacking multiple estimators as input for another estimator, aiming to achieve improved predictive performance. Ultimately, our approach yielded a prediction accuracy of 58.52% for one-day ahead forecasts, coupled with an annual return of 32.48% for the year 2022.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.04471
  6. By: Leonardo Gambacorta; Byeungchun Kwon; Taejin Park; Pietro Patelli; Sonya Zhu
    Abstract: We introduce central bank language models (CB-LMs) - specialised encoder-only language models retrained on a comprehensive corpus of central bank speeches, policy documents and research papers. We show that CB-LMs outperform their foundational models in predicting masked words in central bank idioms. Some CB-LMs not only outperform their foundational models, but also surpass state-of-the-art generative Large Language Models (LLMs) in classifying monetary policy stance from Federal Open Market Committee (FOMC) statements. In more complex scenarios, requiring sentiment classification of extensive news related to the US monetary policy, we find that the largest LLMs outperform the domain-adapted encoder-only models. However, deploying such large LLMs presents substantial challenges for central banks in terms of confidentiality, transparency, replicability and cost-efficiency.
    Keywords: large language models, gen AI, central banks, monetary policy analysis
    JEL: E58 C55 C63 G17
    URL: https://d.repec.org/n?u=RePEc:bis:biswps:1215
  7. By: Boris Ter-Avanesov; Homayoon Beigi
    Abstract: We explore the performance of various artificial neural network architectures, including a multilayer perceptron (MLP), Kolmogorov-Arnold network (KAN), LSTM-GRU hybrid recursive neural network (RNN) models, and a time-delay neural network (TDNN) for pricing European call options. In this study, we attempt to leverage the ability of supervised learning methods, such as ANNs, KANs, and gradient-boosted decision trees, to approximate complex multivariate functions in order to calibrate option prices based on past market data. The motivation for using ANNs and KANs is the Universal Approximation Theorem and Kolmogorov-Arnold Representation Theorem, respectively. Specifically, we use S\&P 500 (SPX) and NASDAQ 100 (NDX) index options traded during 2015-2023 with times to maturity ranging from 15 days to over 4 years (OptionMetrics IvyDB US dataset). Black \& Scholes's (BS) PDE \cite{Black1973} model's performance in pricing the same options compared to real data is used as a benchmark. This model relies on strong assumptions, and it has been observed and discussed in the literature that real data does not match its predictions. Supervised learning methods are widely used as an alternative for calibrating option prices due to some of the limitations of this model. In our experiments, the BS model underperforms compared to all of the others. Also, the best TDNN model outperforms the best MLP model on all error metrics. We implement a simple self-attention mechanism to enhance the RNN models, significantly improving their performance. The best-performing model overall is the LSTM-GRU hybrid RNN model with attention. Also, the KAN model outperforms the TDNN and MLP models. We analyze the performance of all models by ticker, moneyness category, and over/under/correctly-priced percentage.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.06724
  8. By: Wolfgang Keller; Carol H. Shiue; Sen Yan
    Abstract: Primary historical sources are often by-passed for secondary sources due to high human costs of accessing and extracting primary information–especially in lower-resource settings. We propose a supervised machine-learning approach to the natural language processing of Chinese historical data. An application to identifying different forms of social unrest in the Veritable Records of the Qing Dynasty shows that approach cuts dramatically down the cost of using primary source data at the same time when it is free from human bias, reproducible, and flexible enough to address particular questions. External evidence on triggers of unrest also suggests that the computer-based approach is no less successful in identifying social unrest than human researchers are.
    JEL: C8 N45
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:32982
  9. By: Stefania Albanesi; Domonkos F. Vamossy
    Abstract: Credit scores are critical for allocating consumer debt in the United States, yet little evidence is available on their performance. We benchmark a widely used credit score against a machine learning model of consumer default and find significant misclassification of borrowers, especially those with low scores. Our model improves predictive accuracy for young, low-income, and minority groups due to its superior performance with low quality data, resulting in a gain in standing for these populations. Our findings suggest that improving credit scoring performance could lead to more equitable access to credit.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.00296
  10. By: Bryan, Gharad; Karlan, Dean; Osman, Adam
    Abstract: We experimentally study the impact of relatively large enterprise loans in Egypt. Larger loans generate small average impacts, but machine learning using psychometric data reveals "top performers" (those with the highest predicted treatment effects) substantially increase profits, while profits drop for poor performers. The large differences imply that lender credit allocation decisions matter for aggregate income, yet we find existing practice leads to substantial misallocation. We argue that some entrepreneurs are overoptimistic and squander the opportunities presented by larger loans by taking on too much risk, and show the promise of allocations based on entrepreneurial type relative to firm characteristics.
    Keywords: entrepreneurship; enterprise credit; heterogenous treatment effects; psychometric data; small and medium enterprises
    JEL: D24 M21 O12 O16
    Date: 2024–09–20
    URL: https://d.repec.org/n?u=RePEc:ehl:lserod:120637
  11. By: Christa Cuchiero; Eva Flonner; Kevin Kurt
    Abstract: The paper presents a Bayesian framework for the calibration of financial models using neural stochastic differential equations (neural SDEs). The method is based on the specification of a prior distribution on the neural network weights and an adequately chosen likelihood function. The resulting posterior distribution can be seen as a mixture of different classical neural SDE models yielding robust bounds on the implied volatility surface. Both, historical financial time series data and option price data are taken into consideration, which necessitates a methodology to learn the change of measure between the risk-neutral and the historical measure. The key ingredient for a robust numerical optimization of the neural networks is to apply a Langevin-type algorithm, commonly used in the Bayesian approaches to draw posterior samples.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.06551
  12. By: Neel Rajani; Lilli Kiessling; Aleksandr Ogaltsov; Claus Lang
    Abstract: Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4's performance on every tested benchmark.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.13749
  13. By: Junhyeong Lee; Inwoo Tae; Yongjae Lee
    Abstract: Markowitz laid the foundation of portfolio theory through the mean-variance optimization (MVO) framework. However, the effectiveness of MVO is contingent on the precise estimation of expected returns, variances, and covariances of asset returns, which are typically uncertain. Machine learning models are becoming useful in estimating uncertain parameters, and such models are trained to minimize prediction errors, such as mean squared errors (MSE), which treat prediction errors uniformly across assets. Recent studies have pointed out that this approach would lead to suboptimal decisions and proposed Decision-Focused Learning (DFL) as a solution, integrating prediction and optimization to improve decision-making outcomes. While studies have shown DFL's potential to enhance portfolio performance, the detailed mechanisms of how DFL modifies prediction models for MVO remain unexplored. This study aims to investigate how DFL adjusts stock return prediction models to optimize decisions in MVO, addressing the question: "MSE treats the errors of all assets equally, but how does DFL reduce errors of different assets differently?" Answering this will provide crucial insights into optimal stock return prediction for constructing efficient portfolios.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.09684
  14. By: Anthony Coache; Sebastian Jaimungal
    Abstract: In a reinforcement learning (RL) setting, the agent's optimal strategy heavily depends on her risk preferences and the underlying model dynamics of the training environment. These two aspects influence the agent's ability to make well-informed and time-consistent decisions when facing testing environments. In this work, we devise a framework to solve robust risk-aware RL problems where we simultaneously account for environmental uncertainty and risk with a class of dynamic robust distortion risk measures. Robustness is introduced by considering all models within a Wasserstein ball around a reference model. We estimate such dynamic robust risk measures using neural networks by making use of strictly consistent scoring functions, derive policy gradient formulae using the quantile representation of distortion risk measures, and construct an actor-critic algorithm to solve this class of robust risk-aware RL problems. We demonstrate the performance of our algorithm on a portfolio allocation example.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.10096
  15. By: Yang, Kyongchyol; Jung, Jaemin
    Abstract: This study aims to empirically investigate the role of expectations in innovation diffusion using metaverse discourse on social media. We also explore the dominant discourses that emerged during significant periods when expectations shifted. Expectations play a crucial role in the hype cycle and are involved in determining the direction of technological change and the speed of adopting innovations. We analyzed tweet data containing metaverse from January 2021, the year the metaverse first gained public attention, to May 2023, the latest available date. We performed topic modeling and extracted 218 topics. Results revealed that the peak number of tweets in the metaverse was concentrated in specific periods, but the timing of the decline in the number of tweets shows different patterns depending on the topic. After categorizing the extracted topics into themes, we conducted sentiment analysis to observe expectation changes. Expectations of metaverse technology changed over time: initially, positive emotional expectations were high, but after a certain point, they declined. From the discourse analysis, we found that the Metaverse was initially hyped and faced technical and practical limitations. The decline in expectations reflects these limitations. The combination of topic modeling and sentiment analysis provides a new approach to investigating public discourse and a comprehensive view of the collective flow around new technologies.
    Keywords: Metaverse, Innovation diffusion, Innovation expectation, Hype cycle
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:zbw:itsb24:302496
  16. By: Chen, Jinyu; Zhong, Ziqi; Feng, Qindi; Liu, Lei
    Abstract: E-commerce has developed rapidly, and product promotion refers to how e-commerce promotes consumers' consumption activities. The demand and computational complexity in the decision-making process are urgent problems to be solved to optimize dynamic pricing decisions of the e-commerce product lines. Therefore, a Q-learning algorithm model based on the neural network is proposed on the premise of multimodal emotion information recognition and analysis, and the dynamic pricing problem of the product line is studied. The results show that a multi-modal fusion model is established through the multi-modal fusion of speech emotion recognition and image emotion recognition to classify consumers' emotions. Then, they are used as auxiliary materials for understanding and analyzing the market demand. The long short-term memory (LSTM) classifier performs excellent image feature extraction. The accuracy rate is 3.92%-6.74% higher than that of other similar classifiers, and the accuracy rate of the image single-feature optimal model is 9.32% higher than that of the speech single-feature model.
    Keywords: dynamic pricing; E-commerce; emotion recognition; neural network; Q-learning algorithm
    JEL: L81 J50
    Date: 2022–12–31
    URL: https://d.repec.org/n?u=RePEc:ehl:lserod:125409
  17. By: Cyrille Grumbach (ETH Zürich); Didier Sornette (Risks-X, Southern University of Science and Technology (SUSTech); Swiss Finance Institute)
    Abstract: Bitcoin's substantial carbon footprint is widely acknowledged, though debates persist regarding its true scale. In this study, we present a novel methodology to quantify Bitcoin's carbon footprint, demonstrating a dramatic increase from 0.02 MtCOe in 2011 to 89 MtCO 2 e in 2023. By leveraging large language models to analyze Bitcoin Forum data, we accurately identify miners' hardware configurations, addressing the limitations of prior research that lacked empirical data. Our findings also highlight that Bitcoin mining is approaching cost-price parity, positioning it as a potentially enduring financial instrument.
    Keywords: Bitcoin, blockchain technology, Carbon footprint, Cryptocurrency mining, mining hardware
    JEL: C19 C80 Q01 Q56
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:chf:rpseri:rp2451
  18. By: Zhengyang Chi; Junbin Gao; Chao Wang
    Abstract: This study introduces a global stock market volatility forecasting model that enhances forecasting accuracy and practical utility in real-world financial decision-making by integrating dynamic graph structures and encompassing the union of active trading days of different stock markets. The model employs a spatial-temporal graph neural network (GNN) architecture to capture the volatility spillover effect, where shocks in one market spread to others through the interconnective global economy. By calculating the volatility spillover index to depict the volatility network as graphs, the model effectively mirrors the volatility dynamics for the chosen stock market indices. In the empirical analysis, the proposed model surpasses the benchmark model in all forecasting scenarios and is shown to be sensitive to the underlying volatility interrelationships.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.15320

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.