nep-big 2025-08-25 papers

on Big Data

Issue of 2025–08–25
seventeen papers chosen by
Tom Coupé, University of Canterbury

Benchmarking Classical and Quantum Models for DeFi Yield Prediction on Curve Finance By Chi-Sheng Chen; Aidan Hung-Wen Tsai
Neural Network-Based Algorithmic Trading Systems: Multi-Timeframe Analysis and High-Frequency Execution in Cryptocurrency Markets By W\v{e}i Zh\=ang
Causality analysis of electricity market liberalization on electricity price using novel Machine Learning methods By Orr Shahar; Stefan Lessmann; Daniel Traian Pele
Interpretable machine learning for earnings forecasts: Leveraging high-dimensional financial statement data By Hess, Dieter; Simon, Frederik; Weibels, Sebastian
Introducing a New Brexit-Related Uncertainty Index: Its Evolution and Economic Consequences By Ismet Gocer; Julia Darby; Serdar Ongan
Investment Portfolio Optimization Based on Modern Portfolio Theory and Deep Learning Models By Maciej Wysocki; Pawe{\l} Sakowski
Field-Scale Rice Area and Yield Mapping in Sri Lanka with Optical Remote Sensing and Limited Training Data By Özdogan, Mutlu; Wang, Sherrie; Ghose, Devaki; Fraga, Eduardo; Fernandes, Ana Margarida; Varela, Gonzalo J.
Technical Indicator Networks (TINs): An Interpretable Neural Architecture Modernizing Classic al Technical Analysis for Adaptive Algorithmic Trading By Longfei Lu
CreditARF: A Framework for Corporate Credit Rating with Annual Report and Financial Feature Integration By Yumeng Shi; Zhongliang Yang; DiYang Lu; Yisi Wang; Yiting Zhou; Linna Zhou
Language Model Guided Reinforcement Learning in Quantitative Trading By Adam Darmanin; Vince Vella
Comparing Normalization Methods for Portfolio Optimization with Reinforcement Learning By Caio de Souza Barbosa Costa; Anna Helena Reali Costa
Kronos: A Foundation Model for the Language of Financial Markets By Yu Shi; Zongliang Fu; Shuo Chen; Bohan Zhao; Wei Xu; Changshui Zhang; Jian Li
Efficient and Scalable Estimation of Distributional Treatment Effects with Multi-Task Neural Networks By Tomu Hirata; Undral Byambadalai; Tatsushi Oka; Shota Yasui; Shingo Uto
Blind Targeting: Personalization under Third-Party Privacy Constraints By Anya Shchetkina
Federal Reserve Communication and the COVID-19 Pandemic By Jonathan Benchimol; Sophia Kazinnik; Yossi Saadon
SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model By Luyun Lin; Yiqing Wang
Forecasting Climate Policy Uncertainty: Evidence from the United States By Donia Besher; Anirban Sengupta; Tanujit Chakraborty

Benchmarking Classical and Quantum Models for DeFi Yield Prediction on Curve Finance

By:	Chi-Sheng Chen; Aidan Hung-Wen Tsai
Abstract:	The rise of decentralized finance (DeFi) has created a growing demand for accurate yield and performance forecasting to guide liquidity allocation strategies. In this study, we benchmark six models, XGBoost, Random Forest, LSTM, Transformer, quantum neural networks (QNN), and quantum support vector machines with quantum feature maps (QSVM-QNN), on one year of historical data from 28 Curve Finance pools. We evaluate model performance on test MAE, RMSE, and directional accuracy. Our results show that classical ensemble models, particularly XGBoost and Random Forest, consistently outperform both deep learning and quantum models. XGBoost achieves the highest directional accuracy (71.57%) with a test MAE of 1.80, while Random Forest attains the lowest test MAE of 1.77 and 71.36% accuracy. In contrast, quantum models underperform with directional accuracy below 50% and higher errors, highlighting current limitations in applying quantum machine learning to real-world DeFi time series data. This work offers a reproducible benchmark and practical insights into model suitability for DeFi applications, emphasizing the robustness of classical methods over emerging quantum approaches in this domain.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.02685

Neural Network-Based Algorithmic Trading Systems: Multi-Timeframe Analysis and High-Frequency Execution in Cryptocurrency Markets

By:	W\v{e}i Zh\=ang
Abstract:	This paper explores neural network-based approaches for algorithmic trading in cryptocurrency markets. Our approach combines multi-timeframe trend analysis with high-frequency direction prediction networks, achieving positive risk-adjusted returns through statistical modeling and systematic market exploitation. The system integrates diverse data sources including market data, on-chain metrics, and orderbook dynamics, translating these into unified buy/sell pressure signals. We demonstrate how machine learning models can effectively capture cross-timeframe relationships, enabling sub-second trading decisions with statistical confidence.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.02356

Causality analysis of electricity market liberalization on electricity price using novel Machine Learning methods

By:	Orr Shahar; Stefan Lessmann; Daniel Traian Pele
Abstract:	Relationships between the energy and the finance markets are increasingly important. Understanding these relationships is vital for policymakers and other stakeholders as the world faces challenges such as satisfying humanity's increasing need for energy and the effects of climate change. In this paper, we investigate the causal effect of electricity market liberalization on the electricity price in the US. By performing this analysis, we aim to provide new insights into the ongoing debate about the benefits of electricity market liberalization. We introduce Causal Machine Learning as a new approach for interventions in the energy-finance field. The development of machine learning in recent years opened the door for a new branch of machine learning models for causality impact, with the ability to extract complex patterns and relationships from the data. We discuss the advantages of causal ML methods and compare the performance of ML-based models to shed light on the applicability of causal ML frameworks to energy policy intervention cases. We find that the DeepProbCP framework outperforms the other frameworks examined. In addition, we find that liberalization of, and individual players' entry to, the electricity market resulted in a 7% decrease in price in the short term.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.12331

Interpretable machine learning for earnings forecasts: Leveraging high-dimensional financial statement data

By:	Hess, Dieter; Simon, Frederik; Weibels, Sebastian
Abstract:	We predict earnings for forecast horizons of up to five years by using the entire set of Compustat financial statement data as input and providing it to state-of-the-art machine learning models capable of approximating arbitrary functional forms. Our approach improves prediction one year ahead by an average of 11% compared to the traditional linear approach that performs best. This superior performance is consistent across a variety of evaluation metrics as well as different firm subsamples and translates into more profitable investment strategies. Extensive model interpretation reveals that income statement variables, especially different definitions of earnings, are by far the most important predictors. Conversely, we find that while income statement variables decline in relevance, balance sheet information becomes more significant as the forecast horizon extends. Lastly, we show that the influence of interactions and non- linearities on the machine learning forecast is modest, but substantial differences between firm subsamples exist.
Keywords:	Earnings Forecasts, Cross-Sectional Earnings Models, Machine Learning
JEL:	G11 G12 G17 G31 G32 M40 M41
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:zbw:cfrwps:323935

Introducing a New Brexit-Related Uncertainty Index: Its Evolution and Economic Consequences

By:	Ismet Gocer; Julia Darby; Serdar Ongan
Abstract:	Important game-changer economic events and transformations cause uncertainties that may affect investment decisions, capital flows, international trade, and macroeconomic variables. One such major transformation is Brexit, which refers to the multiyear process through which the UK withdrew from the EU. This study develops and uses a new Brexit-Related Uncertainty Index (BRUI). In creating this index, we apply Text Mining, Context Window, Natural Language Processing (NLP), and Large Language Models (LLMs) from Deep Learning techniques to analyse the monthly country reports of the Economist Intelligence Unit from May 2012 to January 2025. Additionally, we employ a standard vector autoregression (VAR) analysis to examine the model-implied responses of various macroeconomic variables to BRUI shocks. While developing the BRUI, we also create a complementary COVID-19 Related Uncertainty Index (CRUI) to distinguish the uncertainties stemming from these distinct events. Empirical findings and comparisons of BRUI with other earlier-developed uncertainty indexes demonstrate the robustness of the new index. This new index can assist British policymakers in measuring and understanding the impacts of Brexit-related uncertainties, enabling more effective policy formulation.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.02439

Investment Portfolio Optimization Based on Modern Portfolio Theory and Deep Learning Models

By:	Maciej Wysocki; Pawe{\l} Sakowski
Abstract:	This paper investigates an important problem of an appropriate variance-covariance matrix estimation in the Modern Portfolio Theory. We propose a novel framework for variancecovariance matrix estimation for purposes of the portfolio optimization, which is based on deep learning models. We employ the long short-term memory (LSTM) recurrent neural networks (RNN) along with two probabilistic deep learning models: DeepVAR and GPVAR to the task of one-day ahead multivariate forecasting. We then use these forecasts to optimize portfolios of stocks and cryptocurrencies. Our analysis presents results across different combinations of observation windows and rebalancing periods to compare performances of classical and deep learning variance-covariance estimation methods. The conclusions of the study are that although the strategies (portfolios) performance differed significantly between different combinations of parameters, generally the best results in terms of the information ratio and annualized returns are obtained using the LSTM-RNN models. Moreover, longer observation windows translate into better performance of the deep learning models indicating that these methods require longer windows to be able to efficiently capture the long-term dependencies of the variance-covariance matrix structure. Strategies with less frequent rebalancing typically perform better than these with the shortest rebalancing windows across all considered methods.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.14999

Field-Scale Rice Area and Yield Mapping in Sri Lanka with Optical Remote Sensing and Limited Training Data

By:	Özdogan, Mutlu; Wang, Sherrie; Ghose, Devaki; Fraga, Eduardo; Fernandes, Ana Margarida; Varela, Gonzalo J.
Abstract:	Rice is a staple crop for over half the world’s population, and accurate, timely information on its planted area and production is crucial for food security and agricultural policy, particularly in developing nations like Sri Lanka. However, reliable rice monitoring in regions like Sri Lanka faces significant challenges due to frequent cloud cover and the fragmented nature of small-holder farms. This research introduces a novel, cost-effective method for mapping rice planted area and yield at field scales in Sri Lanka using optical satellite data. The rice planted fields were identified and mapped using a phenologically-tuned image classification algorithm that high-lights rice presence by observing water occurrence during transplanting and vegetation activity during subsequent crop growth. To estimate yields, a random forest regression model was trained at the district level by incorporating a satellite-derived chlorophyll index and environmental variables and subsequently applied at the field level. The approach has enabled the creation of two decades (2000–2022) of reliable, field-scale rice area and yield estimates, achieving map accuracies between 70% and over 90% and yield estimations with less than 20% RMSE. These highly granular results, which were previously unattainable through traditional surveys, show strong correlation with government statistics. They also demonstrate the ad-vantages of a rule-based, phenology-driven classification over purely statistical machine learning models for long-term consistency in dynamic agricultural environments. This work highlights the significant potential of remote sensing to provide accurate and detailed insights into rice cultivation, supporting policy decisions and enhancing food security in Sri Lanka and other cloud-prone regions.
Date:	2025–08–20
URL:	https://d.repec.org/n?u=RePEc:wbk:wbrwps:11194

Technical Indicator Networks (TINs): An Interpretable Neural Architecture Modernizing Classic al Technical Analysis for Adaptive Algorithmic Trading

By:	Longfei Lu
Abstract:	This work proposes that a vast majority of classical technical indicators in financial analysis are, in essence, special cases of neural networks with fixed and interpretable weights. It is shown that nearly all such indicators, such as moving averages, momentum-based oscillators, volatility bands, and other commonly used technical constructs, can be reconstructed topologically as modular neural network components. Technical Indicator Networks (TINs) are introduced as a general neural architecture that replicates and structurally upgrades traditional indicators by supporting n-dimensional inputs such as price, volume, sentiment, and order book data. By encoding domain-specific knowledge into neural structures, TINs modernize the foundational logic of technical analysis and propel algorithmic trading into a new era, bridging the legacy of proven indicators with the potential of contemporary AI systems.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.20202

CreditARF: A Framework for Corporate Credit Rating with Annual Report and Financial Feature Integration

By:	Yumeng Shi; Zhongliang Yang; DiYang Lu; Yisi Wang; Yiting Zhou; Linna Zhou
Abstract:	Corporate credit rating serves as a crucial intermediary service in the market economy, playing a key role in maintaining economic order. Existing credit rating models rely on financial metrics and deep learning. However, they often overlook insights from non-financial data, such as corporate annual reports. To address this, this paper introduces a corporate credit rating framework that integrates financial data with features extracted from annual reports using FinBERT, aiming to fully leverage the potential value of unstructured text data. In addition, we have developed a large-scale dataset, the Comprehensive Corporate Rating Dataset (CCRD), which combines both traditional financial data and textual data from annual reports. The experimental results show that the proposed method improves the accuracy of the rating predictions by 8-12%, significantly improving the effectiveness and reliability of corporate credit ratings.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.02738

Language Model Guided Reinforcement Learning in Quantitative Trading

By:	Adam Darmanin; Vince Vella
Abstract:	Algorithmic trading requires short-term decisions aligned with long-term financial goals. While reinforcement learning (RL) has been explored for such tactical decisions, its adoption remains limited by myopic behavior and opaque policy rationale. In contrast, large language models (LLMs) have recently demonstrated strategic reasoning and multi-modal financial signal interpretation when guided by well-designed prompts. We propose a hybrid system where LLMs generate high-level trading strategies to guide RL agents in their actions. We evaluate (i) the rationale of LLM-generated strategies via expert review, and (ii) the Sharpe Ratio (SR) and Maximum Drawdown (MDD) of LLM-guided agents versus unguided baselines. Results show improved return and risk metrics over standard RL.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.02366

Comparing Normalization Methods for Portfolio Optimization with Reinforcement Learning

By:	Caio de Souza Barbosa Costa; Anna Helena Reali Costa
Abstract:	Recently, reinforcement learning has achieved remarkable results in various domains, including robotics, games, natural language processing, and finance. In the financial domain, this approach has been applied to tasks such as portfolio optimization, where an agent continuously adjusts the allocation of assets within a financial portfolio to maximize profit. Numerous studies have introduced new simulation environments, neural network architectures, and training algorithms for this purpose. Among these, a domain-specific policy gradient algorithm has gained significant attention in the research community for being lightweight, fast, and for outperforming other approaches. However, recent studies have shown that this algorithm can yield inconsistent results and underperform, especially when the portfolio does not consist of cryptocurrencies. One possible explanation for this issue is that the commonly used state normalization method may cause the agent to lose critical information about the true value of the assets being traded. This paper explores this hypothesis by evaluating two of the most widely used normalization methods across three different markets (IBOVESPA, NYSE, and cryptocurrencies) and comparing them with the standard practice of normalizing data before training. The results indicate that, in this specific domain, the state normalization can indeed degrade the agent's performance.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.03910

Kronos: A Foundation Model for the Language of Financial Markets

By:	Yu Shi; Zongliang Fu; Shuo Chen; Bohan Zhao; Wei Xu; Changshui Zhang; Jian Li
Abstract:	The success of large-scale pre-training paradigm, exemplified by Large Language Models (LLMs), has inspired the development of Time Series Foundation Models (TSFMs). However, their application to financial candlestick (K-line) data remains limited, often underperforming non-pre-trained architectures. Moreover, existing TSFMs often overlook crucial downstream tasks such as volatility prediction and synthetic data generation. To address these limitations, we propose Kronos, a unified, scalable pre-training framework tailored to financial K-line modeling. Kronos introduces a specialized tokenizer that discretizes continuous market information into token sequences, preserving both price dynamics and trade activity patterns. We pre-train Kronos using an autoregressive objective on a massive, multi-market corpus of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. Kronos excels in a zero-shot setting across a diverse set of financial tasks. On benchmark datasets, Kronos boosts price series forecasting RankIC by 93% over the leading TSFM and 87% over the best non-pre-trained baseline. It also achieves a 9% lower MAE in volatility forecasting and a 22% improvement in generative fidelity for synthetic K-line sequences. These results establish Kronos as a robust, versatile foundation model for end-to-end financial time series analysis. Our pre-trained model is publicly available at https://github.com/shiyu-coder/Kronos.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.02739

Efficient and Scalable Estimation of Distributional Treatment Effects with Multi-Task Neural Networks

By:	Tomu Hirata; Undral Byambadalai; Tatsushi Oka; Shota Yasui; Shingo Uto
Abstract:	We propose a novel multi-task neural network approach for estimating distributional treatment effects (DTE) in randomized experiments. While DTE provides more granular insights into the experiment outcomes over conventional methods focusing on the Average Treatment Effect (ATE), estimating it with regression adjustment methods presents significant challenges. Specifically, precision in the distribution tails suffers due to data imbalance, and computational inefficiencies arise from the need to solve numerous regression problems, particularly in large-scale datasets commonly encountered in industry. To address these limitations, our method leverages multi-task neural networks to estimate conditional outcome distributions while incorporating monotonic shape constraints and multi-threshold label learning to enhance accuracy. To demonstrate the practical effectiveness of our proposed method, we apply our method to both simulated and real-world datasets, including a randomized field experiment aimed at reducing water consumption in the US and a large-scale A/B test from a leading streaming platform in Japan. The experimental results consistently demonstrate superior performance across various datasets, establishing our method as a robust and practical solution for modern causal inference applications requiring a detailed understanding of treatment effect heterogeneity.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.07738

Blind Targeting: Personalization under Third-Party Privacy Constraints

By:	Anya Shchetkina
Abstract:	Major advertising platforms recently increased privacy protections by limiting advertisers' access to individual-level data. Instead of providing access to granular raw data, the platforms only allow a limited number of aggregate queries to a dataset, which is further protected by adding differentially private noise. This paper studies whether and how advertisers can design effective targeting policies within these restrictive privacy preserving data environments. To achieve this, I develop a probabilistic machine learning method based on Bayesian optimization, which facilitates dynamic data exploration. Since Bayesian optimization was designed to sample points from a function to find its maximum, it is not applicable to aggregate queries and to targeting. Therefore, I introduce two innovations: (i) integral updating of posteriors which allows to select the best regions of the data to query rather than individual points and (ii) a targeting-aware acquisition function that dynamically selects the most informative regions for the targeting task. I identify the conditions of the dataset and privacy environment that necessitate the use of such a "smart" querying strategy. I apply the strategic querying method to the Criteo AI Labs dataset for uplift modeling (Diemert et al., 2018) that contains visit and conversion data from 14M users. I show that an intuitive benchmark strategy only achieves 33% of the non-privacy-preserving targeting potential in some cases, while my strategic querying method achieves 97-101% of that potential, and is statistically indistinguishable from Causal Forest (Athey et al., 2019): a state-of-the-art non-privacy-preserving machine learning targeting method.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.05175

Federal Reserve Communication and the COVID-19 Pandemic

By:	Jonathan Benchimol; Sophia Kazinnik; Yossi Saadon
Abstract:	In this study, we examine the Federal Reserve's communication strategies during the COVID-19 pandemic, comparing them with communication during previous periods of economic stress. Using specialized dictionaries tailored to COVID-19, unconventional monetary policy (UMP), and financial stability, combined with sentiment analysis and topic modeling techniques, we identify a distinct focus in Fed communication during the pandemic on financial stability, market volatility, social welfare, and UMP, characterized by notable contextual uncertainty. Through comparative analysis, we juxtapose the Fed's communication during the COVID-19 crisis with its responses during the dot-com and global financial crises, examining content, sentiment, and timing dimensions. Our findings reveal that Fed communication and policy actions were more reactive to the COVID-19 crisis than to previous crises. Additionally, declining sentiment related to financial stability in interest rate announcements and minutes anticipated subsequent accommodative monetary policy decisions. We further document that communicating about UMP has become the "new normal" for the Fed's Federal Open Market Committee meeting minutes and Chairman's speeches since the Global Financial Crisis, reflecting an institutional adaptation in communication strategy following periods of economic distress. These findings contribute to our understanding of how central bank communication evolves during crises and how communication strategies adapt to exceptional economic circumstances.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.04830

SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model

By:	Luyun Lin; Yiqing Wang
Abstract:	The increasing development in the consumer credit card market brings substantial regulatory and risk management challenges. The advanced machine learning models applications bring concerns about model transparency and fairness for both financial institutions and regulatory departments. In this study, we evaluate the consistency of one commonly used Explainable AI (XAI) technology, SHAP, for variable explanation in credit card probability of default models via a case study about credit card default prediction. The study shows the consistency is related to the variable importance level and hence provides practical recommendation for credit risk management
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2508.01851

Forecasting Climate Policy Uncertainty: Evidence from the United States

By:	Donia Besher; Anirban Sengupta; Tanujit Chakraborty
Abstract:	Forecasting Climate Policy Uncertainty (CPU) is essential as policymakers strive to balance economic growth with environmental goals. High levels of CPU can slow down investments in green technologies, make regulatory planning more difficult, and increase public resistance to climate reforms, especially during times of economic stress. This study addresses the challenge of forecasting the US CPU index by building the Bayesian Structural Time Series (BSTS) model with a large set of covariates, including economic indicators, financial cycle data, and public sentiments captured through Google Trends. The key strength of the BSTS model lies in its ability to efficiently manage a large number of covariates through its dynamic feature selection mechanism based on the spike-and-slab prior. To validate the effectiveness of the selected features of the BSTS model, an impulse response analysis is performed. The results show that macro-financial shocks impact CPU in different ways over time. Numerical experiments are performed to evaluate the performance of the BSTS model with exogenous variables on the US CPU dataset over different forecasting horizons. The empirical results confirm that BSTS consistently outperforms classical and deep learning frameworks, particularly for semi-long-term and long-term forecasts.
Date:	2025–07
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.12276

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.