nep-big 2025-09-22 papers

on Big Data

Issue of 2025–09–22
29 papers chosen by
Tom Coupé, University of Canterbury

An Interpretable Deep Learning Model for General Insurance Pricing By Patrick J. Laub; Tu Pho; Bernard Wong
Identifying economic narratives in large text corpora: An integrated approach using large language models By Schmidt, Tobias; Lange, Kai-Robin; Reccius, Matthias; Müller, Henrik; Roos, Michael W. M.; Jentsch, Carsten
A comparative analysis of machine learning algorithms for predicting probabilities of default By Adrian Iulian Cristescu; Matteo Giordano
News Sentiment Embeddings for Stock Price Forecasting By Ayaan Qayyum
Predicting Market Troughs: A Machine Learning Approach with Causal Interpretation By Peilin Rao; Randall R. Rojas
Beyond GARCH: Bayesian Neural Stochastic Volatility By Guo, Hongfei; Marín Díazaraque, Juan Miguel; Veiga, Helena
Deep Learning Option Pricing with Market Implied Volatility Surfaces By Lijie Ding; Egang Lu; Kin Cheung
Random Forests for Labor Market Analysis: Balancing Precision and Interpretability By Daniel Graeber; Lorenz Meister; Carsten Schröder; Sabine Zinn
Finance-Grounded Optimization For Algorithmic Trading By Kasymkhan Khubiev; Mikhail Semenov; Irina Podlipnova
Machine Learning Enhanced Multi-Factor Quantitative Trading: A Cross-Sectional Portfolio Optimization Approach with Bias Correction By Yimin Du
Automated Trading System for Straddle-Option Based on Deep Q-Learning By Yiran Wan; Xinyu Ying; Shengzhen Xu
Overparametrized models with posterior drift By Guillaume Coqueret; Martial Laguerre
Deep Learning for Conditional Asset Pricing Models By Hongyi Liu
MM-DREX: Multimodal-Driven Dynamic Routing of LLM Experts for Financial Trading By Yang Chen; Yueheng Jiang; Zhaozhao Ma; Yuchen Cao Jacky Keung; Kun Kuang; Leilei Gan; Yiquan Wu; Fei Wu
Transformers Beyond Order: A Chaos-Markov-Gaussian Framework for Short-Term Sentiment Forecasting of Any Financial OHLC timeseries Data By Arif Pathan
Neural ARFIMA model for forecasting BRIC exchange rates with long memory under oil shocks and policy uncertainties By Tanujit Chakraborty; Donia Besher; Madhurima Panja; Shovon Sengupta
DeltaHedge: A Multi-Agent Framework for Portfolio Options Optimization By Feliks Ba\'nka; Jaros{\l}aw A. Chudziak
Social Group Bias in AI Finance By Thomas R. Cook; Sophia Kazinnik
Optimization Method of Multi-factor Investment Model Driven by Deep Learning for Risk Control By Ruisi Li; Xinhui Gu
Theory Meets Textual Analysis: Measuring Firm-Level Labor Cost Pressures and Inflation Pass-Through By Aakash Kalyani; Serdar Ozkan
Understanding Patenting Disparities via Causal Human+Machine Learning By Lin William Cong; Stephen Q. Yang
Governing Synthetic Data in the Financial Sector By Spears, Taylor C.; Hansen, Kristian Bondo; Xu, Ruowen; Millo, Yuval
A Decision Theoretic Perspective on Artificial Superintelligence: Coping with Missing Data Problems in Prediction and Treatment Choice By Jeff Dominitz; Charles F. Manski
Forecasting Labor Markets with LSTNet: A Multi-Scale Deep Learning Approach By Adam Nelson-Archer; Aleia Sen; Meena Al Hasani; Sofia Davila; Jessica Le; Omar Abbouchi
Neural Functionally Generated Portfolios By Michael Monoyios; Olivia Pricilia
P-CRE-DML: A Novel Approach for Causal Inference in Non-Linear Panel Data By Amarendra Sharma
Nowcasting Disruptions to Human Capital Formation : Evidence from High-Frequency Household and Geospatial Data in Rural Malawi By Tennant, Elizabeth J.; Michuda, Aleksandr; Upton, Joanna B.; Chamorro, Andres; Engstrom, Ryan; Mann, Michael L.; Newhouse, David; Weber, Michael; Barrett, Christopher B.
Myopic Optimality: why reinforcement learning portfolio management strategies lose money By Yuming Ma
Breaking the Echo Chamber: Social Media Networks and Political Conflict By Francesco Slataper; Luis Menéndez; Daniel Montolio; Hannes Mueller

An Interpretable Deep Learning Model for General Insurance Pricing

By:	Patrick J. Laub; Tu Pho; Bernard Wong
Abstract:	This paper introduces the Actuarial Neural Additive Model, an inherently interpretable deep learning model for general insurance pricing that offers fully transparent and interpretable results while retaining the strong predictive power of neural networks. This model assigns a dedicated neural network (or subnetwork) to each individual covariate and pairwise interaction term to independently learn its impact on the modeled output while implementing various architectural constraints to allow for essential interpretability (e.g. sparsity) and practical requirements (e.g. smoothness, monotonicity) in insurance applications. The development of our model is grounded in a solid foundation, where we establish a concrete definition of interpretability within the insurance context, complemented by a rigorous mathematical framework. Comparisons in terms of prediction accuracy are made with traditional actuarial and state-of-the-art machine learning methods using both synthetic and real insurance datasets. The results show that the proposed model outperforms other methods in most cases while offering complete transparency in its internal logic, underscoring the strong interpretability and predictive capability.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.08467

Identifying economic narratives in large text corpora: An integrated approach using large language models

By:	Schmidt, Tobias; Lange, Kai-Robin; Reccius, Matthias; Müller, Henrik; Roos, Michael W. M.; Jentsch, Carsten
Abstract:	As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts. Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task. While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic understanding required to distinguish extracting economic narratives from merely conducting classic tasks like Semantic Role Labeling. Instead of relying on complex model pipelines, we evaluate the benefits of Large Language Models (LLMs) by analyzing a corpus of Wall Street Journal and New York Times newspaper articles about inflation. We apply a rigorous narrative definition and compare GPT 4o outputs to gold-standard narratives produced by expert annotators. Our results suggests that GPT-4o is capable of extracting valid economic narratives in a structured format, but still falls short of expert-level performance when handling complex documents and narratives. Given the novelty of LLMs in economic research, we also provide guidance for future work in economics and the social sciences that employs LLMs to pursue similar objectives.
Keywords:	Economic narratives, natural language processing, large language models
JEL:	C18 C55 C87 E70
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:zbw:rwirep:325494

A comparative analysis of machine learning algorithms for predicting probabilities of default

By:	Adrian Iulian Cristescu; Matteo Giordano
Abstract:	Predicting the probability of default (PD) of prospective loans is a critical objective for financial institutions. In recent years, machine learning (ML) algorithms have achieved remarkable success across a wide variety of prediction tasks; yet, they remain relatively underutilised in credit risk analysis. This paper highlights the opportunities that ML algorithms offer to this field by comparing the performance of five predictive models-Random Forests, Decision Trees, XGBoost, Gradient Boosting and AdaBoost-to the predominantly used logistic regression, over a benchmark dataset from Scheule et al. (Credit Risk Analytics: The R Companion). Our findings underscore the strengths and weaknesses of each method, providing valuable insights into the most effective ML algorithms for PD prediction in the context of loan portfolios.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2506.19789

News Sentiment Embeddings for Stock Price Forecasting

By:	Ayaan Qayyum
Abstract:	This paper will discuss how headline data can be used to predict stock prices. The stock price in question is the SPDR S&P 500 ETF Trust, also known as SPY that tracks the performance of the largest 500 publicly traded corporations in the United States. A key focus is to use news headlines from the Wall Street Journal (WSJ) to predict the movement of stock prices on a daily timescale with OpenAI-based text embedding models used to create vector encodings of each headline with principal component analysis (PCA) to exact the key features. The challenge of this work is to capture the time-dependent and time-independent, nuanced impacts of news on stock prices while handling potential lag effects and market noise. Financial and economic data were collected to improve model performance; such sources include the U.S. Dollar Index (DXY) and Treasury Interest Yields. Over 390 machine-learning inference models were trained. The preliminary results show that headline data embeddings greatly benefit stock price prediction by at least 40% compared to training and optimizing a machine learning system without headline data embeddings.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.01970

Predicting Market Troughs: A Machine Learning Approach with Causal Interpretation

By:	Peilin Rao; Randall R. Rojas
Abstract:	This paper provides robust, new evidence on the causal drivers of market troughs. We demonstrate that conclusions about these triggers are critically sensitive to model specification, moving beyond restrictive linear models with a flexible DML average partial effect causal machine learning framework. Our robust estimates identify the volatility of options-implied risk appetite and market liquidity as key causal drivers, relationships misrepresented or obscured by simpler models. These findings provide high-frequency empirical support for intermediary asset pricing theories. This causal analysis is enabled by a high-performance nowcasting model that accurately identifies capitulation events in real-time.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.05922

Beyond GARCH: Bayesian Neural Stochastic Volatility

By:	Guo, Hongfei; Marín Díazaraque, Juan Miguel; Veiga, Helena
Abstract:	Accurately forecasting volatility is central to risk management, portfolio allocation, and asset pricing. While high-frequency realised measures have been shown to improve predictive accuracy, their value is not uniform across markets or horizons. This paper introduces a class of Bayesian neural network stochastic volatility (NN-SV) models that combine the flexibility of machine learning with the structure of stochastic volatility models. The specifications incorporate realised variance, jump variation, and semivariance from daily and intraday data, and model uncertainty is addressed through a Bayesian stacking ensemble that adaptively aggregates predictive distributions. Using data from the DAX, FTSE 100, and S&P 500 indices, the models are evaluated against classical GARCH and parametric SV benchmarks. The results show that the predictive content of high-frequency measures is horizon- and market-specific. The Bayesian ensemble further enhances robustness by exploiting complementary model strengths. Overall, NN-SV models not only outperform established benchmarks in many settings but also provide new insights into market-specific drivers of volatility dynamics.
Keywords:	Ensemble forecasts; GARCH; Neural networks; Realised volatility; Stochastic volatility
JEL:	C11 C32 C45 C53 C58
Date:	2025–09–16
URL:	https://d.repec.org/n?u=RePEc:cte:wsrepe:47944

Deep Learning Option Pricing with Market Implied Volatility Surfaces

By:	Lijie Ding; Egang Lu; Kin Cheung
Abstract:	We present a deep learning framework for pricing options based on market-implied volatility surfaces. Using end-of-day S\&P 500 index options quotes from 2018-2023, we construct arbitrage-free volatility surfaces and generate training data for American puts and arithmetic Asian options using QuantLib. To address the high dimensionality of volatility surfaces, we employ a variational autoencoder (VAE) that compresses volatility surfaces across maturities and strikes into a 10-dimensional latent representation. We feed these latent variables, combined with option-specific inputs such as strike and maturity, into a multilayer perceptron to predict option prices. Our model is trained in stages: first to train the VAE for volatility surface compression and reconstruction, then options pricing mapping, and finally fine-tune the entire network end-to-end. The trained pricer achieves high accuracy across American and Asian options, with prediction errors concentrated primarily near long maturities and at-the-money strikes, where absolute bid-ask price differences are known to be large. Our method offers an efficient and scalable approach requiring only a single neural network forward pass and naturally improve with additional data. By bridging volatility surface modeling and option pricing in a unified framework, it provides a fast and flexible alternative to traditional numerical approaches for exotic options.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.05911

Random Forests for Labor Market Analysis: Balancing Precision and Interpretability

By:	Daniel Graeber; Lorenz Meister; Carsten Schröder; Sabine Zinn
Abstract:	Machine learning is increasingly used in social science research, especially for prediction. However, the results are sometimes not as straight-forward to interpret compared to classic regression models. In this paper, we address this trade-off by comparing the predictive performance of random forests and logit regressions to analyze labor market vulnerabilities during the COVID-19 pandemic, and a global surrogate model to enhance our understanding of the complex dynamics. Our study shows that, especially in the presence of non-linearities and feature interactions, random forests outperform regressions both in predictive accuracy and interpretability, yielding policy-relevant insights on vulnerable groups affected by labor market disruptions
Keywords:	Machine learning, interpretability, labor market, random forests
JEL:	C45 C53 C25 J08 I18 C83 J21
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:diw:diwsop:diw_sp1230

Finance-Grounded Optimization For Algorithmic Trading

By:	Kasymkhan Khubiev; Mikhail Semenov; Irina Podlipnova
Abstract:	Deep Learning is evolving fast and integrates into various domains. Finance is a challenging field for deep learning, especially in the case of interpretable artificial intelligence (AI). Although classical approaches perform very well with natural language processing, computer vision, and forecasting, they are not perfect for the financial world, in which specialists use different metrics to evaluate model performance. We first introduce financially grounded loss functions derived from key quantitative finance metrics, including the Sharpe ratio, Profit-and-Loss (PnL), and Maximum Draw down. Additionally, we propose turnover regularization, a method that inherently constrains the turnover of generated positions within predefined limits. Our findings demonstrate that the proposed loss functions, in conjunction with turnover regularization, outperform the traditional mean squared error loss for return prediction tasks when evaluated using algorithmic trading metrics. The study shows that financially grounded metrics enhance predictive performance in trading strategies and portfolio optimization.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.04541

Machine Learning Enhanced Multi-Factor Quantitative Trading: A Cross-Sectional Portfolio Optimization Approach with Bias Correction

By:	Yimin Du
Abstract:	This paper presents a comprehensive machine learning framework for quantitative trading that achieves superior risk-adjusted returns through systematic factor engineering, real-time computation optimization, and cross-sectional portfolio construction. Our approach integrates multi-factor alpha discovery with bias correction techniques, leveraging PyTorch-accelerated factor computation and advanced portfolio optimization. The system processes 500-1000 factors derived from open-source alpha101 extensions and proprietary market microstructure signals. Key innovations include tensor-based factor computation acceleration, geometric Brownian motion data augmentation, and cross-sectional neutralization strategies. Empirical validation on Chinese A-share markets (2010-2024) demonstrates annualized returns of $20\%$ with Sharpe ratios exceeding 2.0, significantly outperforming traditional approaches. Our analysis reveals the critical importance of bias correction in factor construction and the substantial impact of cross-sectional portfolio optimization on strategy performance. Code and experimental implementations are available at: https://github.com/initial-d/ml-quant-tr ading
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.07107

Automated Trading System for Straddle-Option Based on Deep Q-Learning

By:	Yiran Wan; Xinyu Ying; Shengzhen Xu
Abstract:	Straddle Option is a financial trading tool that explores volatility premiums in high-volatility markets without predicting price direction. Although deep reinforcement learning has emerged as a powerful approach to trading automation in financial markets, existing work mostly focused on predicting price trends and making trading decisions by combining multi-dimensional datasets like blogs and videos, which led to high computational costs and unstable performance in high-volatility markets. To tackle this challenge, we develop automated straddle option trading based on reinforcement learning and attention mechanisms to handle unpredictability in high-volatility markets. Firstly, we leverage the attention mechanisms in Transformer-DDQN through both self-attention with time series data and channel attention with multi-cycle information. Secondly, a novel reward function considering excess earnings is designed to focus on long-term profits and neglect short-term losses over a stop line. Thirdly, we identify the resistance levels to provide reference information when great uncertainty in price movements occurs with intensified battle between the buyers and sellers. Through extensive experiments on the Chinese stock, Brent crude oil, and Bitcoin markets, our attention-based Transformer-DDQN model exhibits the lowest maximum drawdown across all markets, and outperforms other models by 92.5\% in terms of the average return excluding the crude oil market due to relatively low fluctuation.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.07987

Overparametrized models with posterior drift

By:	Guillaume Coqueret; Martial Laguerre
Abstract:	This paper investigates the impact of posterior drift on out-of-sample forecasting accuracy in overparametrized machine learning models. We document the loss in performance when the loadings of the data generating process change between the training and testing samples. This matters crucially in settings in which regime changes are likely to occur, for instance, in financial markets. Applied to equity premium forecasting, our results underline the sensitivity of a market timing strategy to sub-periods and to the bandwidth parameters that control the complexity of the model. For the average investor, we find that focusing on holding periods of 15 years can generate very heterogeneous returns, especially for small bandwidths. Large bandwidths yield much more consistent outcomes, but are far less appealing from a risk-adjusted return standpoint. All in all, our findings tend to recommend cautiousness when resorting to large linear models for stock market predictions.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2506.23619

Deep Learning for Conditional Asset Pricing Models

By:	Hongyi Liu
Abstract:	We propose a new pseudo-Siamese Network for Asset Pricing (SNAP) model, based on deep learning approaches, for conditional asset pricing. Our model allows for the deep alpha, deep beta and deep factor risk premia conditional on high dimensional observable information of financial characteristics and macroeconomic states, while storing the long-term dependency of the informative features through long short-term memory network. We apply this method to monthly U.S. stock returns from 1970-2019 and find that our pseudo-SNAP model outperforms the benchmark approaches in terms of out-of-sample prediction and out-of-sample Sharpe ratio. In addition, we also apply our method to calculate deep mispricing errors which we use to construct an arbitrage portfolio K-Means clustering. We find that the arbitrage portfolio has significant alphas.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.04812

MM-DREX: Multimodal-Driven Dynamic Routing of LLM Experts for Financial Trading

By:	Yang Chen; Yueheng Jiang; Zhaozhao Ma; Yuchen Cao Jacky Keung; Kun Kuang; Leilei Gan; Yiquan Wu; Fei Wu
Abstract:	The inherent non-stationarity of financial markets and the complexity of multi-modal information pose significant challenges to existing quantitative trading models. Traditional methods relying on fixed structures and unimodal data struggle to adapt to market regime shifts, while large language model (LLM)-driven solutions - despite their multi-modal comprehension - suffer from static strategies and homogeneous expert designs, lacking dynamic adjustment and fine-grained decision mechanisms. To address these limitations, we propose MM-DREX: a Multimodal-driven, Dynamically-Routed EXpert framework based on large language models. MM-DREX explicitly decouples market state perception from strategy execution to enable adaptive sequential decision-making in non-stationary environments. Specifically, it (1) introduces a vision-language model (VLM)-powered dynamic router that jointly analyzes candlestick chart patterns and long-term temporal features to allocate real-time expert weights; (2) designs four heterogeneous trading experts (trend, reversal, breakout, positioning) generating specialized fine-grained sub-strategies; and (3) proposes an SFT-RL hybrid training paradigm to synergistically optimize the router's market classification capability and experts' risk-adjusted decision-making. Extensive experiments on multi-modal datasets spanning stocks, futures, and cryptocurrencies demonstrate that MM-DREX significantly outperforms 15 baselines (including state-of-the-art financial LLMs and deep reinforcement learning models) across key metrics: total return, Sharpe ratio, and maximum drawdown, validating its robustness and generalization. Additionally, an interpretability module traces routing logic and expert behavior in real time, providing an audit trail for strategy transparency.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.05080

Transformers Beyond Order: A Chaos-Markov-Gaussian Framework for Short-Term Sentiment Forecasting of Any Financial OHLC timeseries Data

By:	Arif Pathan
Abstract:	Short-term sentiment forecasting in financial markets (e.g., stocks, indices) is challenging due to volatility, non-linearity, and noise in OHLC (Open, High, Low, Close) data. This paper introduces a novel CMG (Chaos-Markov-Gaussian) framework that integrates chaos theory, Markov property, and Gaussian processes to improve prediction accuracy. Chaos theory captures nonlinear dynamics; the Markov chain models regime shifts; Gaussian processes add probabilistic reasoning. We enhance the framework with transformer-based deep learning models to capture temporal patterns efficiently. The CMG Framework is designed for fast, resource-efficient, and accurate forecasting of any financial instrument's OHLC time series. Unlike traditional models that require heavy infrastructure and instrument-specific tuning, CMG reduces overhead and generalizes well. We evaluate the framework on market indices, forecasting sentiment for the next trading day's first quarter. A comparative study against statistical, ML, and DL baselines trained on the same dataset with no feature engineering shows CMG consistently outperforms in accuracy and efficiency, making it valuable for analysts and financial institutions.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2506.17244

Neural ARFIMA model for forecasting BRIC exchange rates with long memory under oil shocks and policy uncertainties

By:	Tanujit Chakraborty; Donia Besher; Madhurima Panja; Shovon Sengupta
Abstract:	Accurate forecasting of exchange rates remains a persistent challenge, particularly for emerging economies such as Brazil, Russia, India, and China (BRIC). These series exhibit long memory, nonlinearity, and non-stationarity properties that conventional time series models struggle to capture. Additionally, there exist several key drivers of exchange rate dynamics, including global economic policy uncertainty, US equity market volatility, US monetary policy uncertainty, oil price growth rates, and country-specific short-term interest rate differentials. These empirical complexities underscore the need for a flexible modeling framework that can jointly accommodate long memory, nonlinearity, and the influence of external drivers. To address these challenges, we propose a Neural AutoRegressive Fractionally Integrated Moving Average (NARFIMA) model that combines the long-memory representation of ARFIMA with the nonlinear learning capacity of neural networks, while flexibly incorporating exogenous causal variables. We establish theoretical properties of the model, including asymptotic stationarity of the NARFIMA process using Markov chains and nonlinear time series techniques. We quantify forecast uncertainty using conformal prediction intervals within the NARFIMA framework. Empirical results across six forecast horizons show that NARFIMA consistently outperforms various state-of-the-art statistical and machine learning models in forecasting BRIC exchange rates. These findings provide new insights for policymakers and market participants navigating volatile financial conditions. The \texttt{narfima} \textbf{R} package provides an implementation of our approach.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.06697

DeltaHedge: A Multi-Agent Framework for Portfolio Options Optimization

By:	Feliks Ba\'nka (Warsaw University of Technology, Faculty of Electronics and Information Technology); Jaros{\l}aw A. Chudziak (Warsaw University of Technology)
Abstract:	In volatile financial markets, balancing risk and return remains a significant challenge. Traditional approaches often focus solely on equity allocation, overlooking the strategic advantages of options trading for dynamic risk hedging. This work presents DeltaHedge, a multi-agent framework that integrates options trading with AI-driven portfolio management. By combining advanced reinforcement learning techniques with an ensembled options-based hedging strategy, DeltaHedge enhances risk-adjusted returns and stabilizes portfolio performance across varying market conditions. Experimental results demonstrate that DeltaHedge outperforms traditional strategies and standalone models, underscoring its potential to transform practical portfolio management in complex financial environments. Building on these findings, this paper contributes to the fields of quantitative finance and AI-driven portfolio optimization by introducing a novel multi-agent system for integrating options trading strategies, addressing a gap in the existing literature.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.12753

Social Group Bias in AI Finance

By:	Thomas R. Cook; Sophia Kazinnik
Abstract:	Financial institutions increasingly rely on large language models (LLMs) for high-stakes decision-making. However, these models risk perpetuating harmful biases if deployed without careful oversight. This paper investigates racial bias in LLMs specifically through the lens of credit decision-making tasks, operating on the premise that biases identified here are indicative of broader concerns across financial applications. We introduce a reproducible, counterfactual testing framework that evaluates how models respond to simulated mortgage applicants identical in all attributes except race. Our results reveal significant race-based discrepancies, exceeding historically observed bias levels. Leveraging layer-wise analysis, we track the propagation of sensitive attributes through internal model representations. Building on this, we deploy a control-vector intervention that effectively reduces racial disparities by up to 70% (33% on average) without impairing overall model performance. Our approach provides a transparent and practical toolkit for the identification and mitigation of bias in financial LLM deployments.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2506.17490

Optimization Method of Multi-factor Investment Model Driven by Deep Learning for Risk Control

By:	Ruisi Li; Xinhui Gu
Abstract:	Propose a deep learning driven multi factor investment model optimization method for risk control. By constructing a deep learning model based on Long Short Term Memory (LSTM) and combining it with a multi factor investment model, we optimize factor selection and weight determination to enhance the model's adaptability and robustness to market changes. Empirical analysis shows that the LSTM model is significantly superior to the benchmark model in risk control indicators such as maximum retracement, Sharp ratio and value at risk (VaR), and shows strong adaptability and robustness in different market environments. Furthermore, the model is applied to the actual portfolio to optimize the asset allocation, which significantly improves the performance of the portfolio, provides investors with more scientific and accurate investment decision-making basis, and effectively balances the benefits and risks.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.00332

Theory Meets Textual Analysis: Measuring Firm-Level Labor Cost Pressures and Inflation Pass-Through

By:	Aakash Kalyani; Serdar Ozkan
Abstract:	We develop a novel measure of firm-level marginal labor cost and investigate its pass-through to inflation. To construct this measure, we apply textual analysis to earnings calls to identify discussions of labor-related topics such as higher costs, shortages, and hiring. Leveraging the theoretical principle that cost-minimizing firms equate marginal costs across variable inputs, we project changes in firms intermediate input revenue shares onto the intensity of labor-related discussions to quantify their contributions to marginal labor costs. This approach provides an economically-motivated way to reduce the multidimensional qualitative textual information into a single quantitative measure. An aggregate index from this measure tracks closely with conventional aggregate slack variables and outperforms them in forecasting inflation. When aggregated at the industry level, we find a significant but heterogeneous pass-through of marginal labor costs to PPI inflation, with the pass-through highest for service sector and near-zero for manufacturing. Consistent with the latter fact, firm-level data reveal that investment in automation mitigates the effects of higher labor cost pressures in manufacturing.
Keywords:	wage inflation; automation; textual data; machine learning
JEL:	E24 J24 J31 J64
Date:	2025–07–06
URL:	https://d.repec.org/n?u=RePEc:fip:fedlwp:101743

Understanding Patenting Disparities via Causal Human+Machine Learning

By:	Lin William Cong; Stephen Q. Yang
Abstract:	We develop an empirical approach for analyzing multi-dimensional discrimination using multimodal data, combining human perception measures with language-embedding-based, nonlinear controls for latent quality to relax restrictive assumptions in causal machine learning. Applying it to the U.S. patent examination process, we find that, ceteris paribus, applications from female inventors are 1.8 percentage points less likely to be approved, and those from Black inventors are 3 percentage points less likely—inconsistent with legally prescribed criteria. Jointly studying multiple bias dimensions and their intersections for the first time, we uncover new biases, including an affiliation bias—individual inventors are disadvantaged by 6.6 percentage points relative to employees of large, public firms, a disparity larger than any demographic gap. Moreover, innovation quality, location, and other factors can mitigate or compound discrimination, and the disparities interact: for example, racial gaps vanish among public-firm employees, masking more severe discrimination against individuals. Existing theories such as homophily cannot fully explain the results, but a simple model of correlation neglect does.
JEL:	G30 J15 J16 O31
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:34197

Governing Synthetic Data in the Financial Sector

By:	Spears, Taylor C. (University of Edinburgh); Hansen, Kristian Bondo; Xu, Ruowen; Millo, Yuval
Abstract:	Synthetic datasets, artificially generated to mimic real-world data while maintaining anonymization, have emerged as a promising technology in the financial sector, attracting support from regulators and market participants as a solution to data privacy and scarcity challenges limiting machine learning deployment. This paper argues that synthetic data's effects on financial markets depend critically on how these technologies are embedded within existing machine learning infrastructural ``stacks'' rather than on their intrinsic properties. We identify three key tensions that will determine whether adoption proves beneficial or harmful: (1) data circulability versus opacity, particularly the "double opacity" problem arising from stacked machine learning systems, (2) model-induced scattering versus model-induced herding in market participant behaviour, and (3) flattening versus deepening of data platform power. These tensions directly correspond to core regulatory priorities around model risk management, systemic risk, and competition policy. Using financial audit as a case study, we demonstrate how these tensions interact in practice and propose governance frameworks, including a synthetic data labelling regime to preserve contextual information when datasets cross organizational boundaries.
Date:	2025–09–08
URL:	https://d.repec.org/n?u=RePEc:osf:socarx:ruxkh_v1

A Decision Theoretic Perspective on Artificial Superintelligence: Coping with Missing Data Problems in Prediction and Treatment Choice

By:	Jeff Dominitz; Charles F. Manski
Abstract:	Enormous attention and resources are being devoted to the quest for artificial general intelligence and, even more ambitiously, artificial superintelligence. We wonder about the implications for our methodological research, which aims to help decision makers cope with what econometricians call identification problems, inferential problems in empirical research that do not diminish as sample size grows. Of particular concern are missing data problems in prediction and treatment choice. Essentially all data collection intended to inform decision making is subject to missing data, which gives rise to identification problems. Thus far, we see no indication that the current dominant architecture of machine learning (ML)-based artificial intelligence (AI) systems will outperform humans in this context. In this paper, we explain why we have reached this conclusion and why we see the missing data problem as a cautionary case study in the quest for superintelligence more generally. We first discuss the concept of intelligence, before presenting a decision-theoretic perspective that formalizes the connection between intelligence and identification problems. We next apply this perspective to two leading cases of missing data problems. Then we explain why we are skeptical that AI research is currently on a path toward machines doing better than humans at solving these identification problems.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.12388

Forecasting Labor Markets with LSTNet: A Multi-Scale Deep Learning Approach

By:	Adam Nelson-Archer; Aleia Sen; Meena Al Hasani; Sofia Davila; Jessica Le; Omar Abbouchi
Abstract:	We present a deep learning approach for forecasting short-term employment changes and assessing long-term industry health using labor market data from the U.S. Bureau of Labor Statistics. Our system leverages a Long- and Short-Term Time-series Network (LSTNet) to process multivariate time series data, including employment levels, wages, turnover rates, and job openings. The model outputs both 7-day employment forecasts and an interpretable Industry Employment Health Index (IEHI). Our approach outperforms baseline models across most sectors, particularly in stable industries, and demonstrates strong alignment between IEHI rankings and actual employment volatility. We discuss error patterns, sector-specific performance, and future directions for improving interpretability and generalization.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2507.01979

Neural Functionally Generated Portfolios

By:	Michael Monoyios; Olivia Pricilia
Abstract:	We introduce a novel neural-network-based approach to learning the generating function $G(\cdot)$ of a functionally generated portfolio (FGP) from synthetic or real market data. In the neural network setting, the generating function is represented as $G_{\theta}(\cdot)$, where $\theta$ is an iterable neural network parameter vector, and $G_{\theta}(\cdot)$ is trained to maximise investment return relative to the market portfolio. We compare the performance of the Neural FGP approach against classical FGP benchmarks. FGPs provide a robust alternative to classical portfolio optimisation by bypassing the need to estimate drifts or covariances. The neural FGP framework extends this by introducing flexibility in the design of the generating function, enabling it to learn from market dynamics while preserving self-financing and pathwise decomposition properties.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2506.19715

P-CRE-DML: A Novel Approach for Causal Inference in Non-Linear Panel Data

By:	Amarendra Sharma
Abstract:	This paper introduces a novel Proxy-Enhanced Correlated Random Effects Double Machine Learning (P-CRE-DML) framework to estimate causal effects in panel data with non-linearities and unobserved heterogeneity. Combining Double Machine Learning (DML, Chernozhukov et al., 2018), Correlated Random Effects (CRE, Mundlak, 1978), and lagged variables (Arellano & Bond, 1991) and innovating within the CRE-DML framework (Chernozhukov et al., 2022; Clarke & Polselli, 2025; Fuhr & Papies, 2024), we apply P-CRE-DML to investigate the effect of social trust on GDP growth across 89 countries (2010-2020). We find positive and statistically significant relationship between social trust and economic growth. This aligns with prior findings on trust-growth relationship (e.g., Knack & Keefer, 1997). Furthermore, a Monte Carlo simulation demonstrates P-CRE-DML's advantage in terms of lower bias over CRE-DML and System GMM. P-CRE-DML offers a robust and flexible alternative for panel data causal inference, with applications beyond economic growth.
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2506.23297

Nowcasting Disruptions to Human Capital Formation : Evidence from High-Frequency Household and Geospatial Data in Rural Malawi

By:	Tennant, Elizabeth J.; Michuda, Aleksandr; Upton, Joanna B.; Chamorro, Andres; Engstrom, Ryan; Mann, Michael L.; Newhouse, David; Weber, Michael; Barrett, Christopher B.
Abstract:	Exposure to extreme weather events and other adverse shocks has led to an increasing number of humanitarian crises in developing countries in recent years. These events cause acute suffering and compromise future welfare by adversely impacting human capital formation among vulnerable populations. Early and accurate detection of ad- verse shocks to food security, health, and schooling is critical to facilitating timely and well-targeted humanitarian interventions to minimize these detrimental effects. Yet monitoring data are rarely available with the frequency and spatial granularity needed. This paper uses high-frequency household survey data from the Rapid Feedback Monitoring System, collected in 2020–23 in southern Malawi, to explore whether combining monthly data with publicly available remote-sensing features improves the accuracy of machine learning extrapolations across time and space, thereby enhancing monitoring efforts. In the sample, illnesses and schooling disruptions are not reliably predicted. However, when both lagged outcome data and geospatial features are available, intertemporal and spatiotemporal prediction of food insecurity indicators is promising.
Date:	2025–09–03
URL:	https://d.repec.org/n?u=RePEc:wbk:wbrwps:11202

Myopic Optimality: why reinforcement learning portfolio management strategies lose money

By:	Yuming Ma
Abstract:	Myopic optimization (MO) outperforms reinforcement learning (RL) in portfolio management: RL yields lower or negative returns, higher variance, larger costs, heavier CVaR, lower profitability, and greater model risk. We model execution/liquidation frictions with mark-to-market accounting. Using Malliavin calculus (Clark-Ocone/BEL), we derive policy gradients and risk shadow price, unifying HJB and KKT. This gives dual gap and convergence results: geometric MO vs. RL floors. We quantify phantom profit in RL via Malliavin policy-gradient contamination analysis and define a control-affects-dynamics (CAD) premium of RL indicating plausibly positive.
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2509.12764

Breaking the Echo Chamber: Social Media Networks and Political Conflict

By:	Francesco Slataper; Luis Menéndez; Daniel Montolio; Hannes Mueller
Abstract:	This article exploits data from a political conflict between language groups to show how political events can rapidly redefine how these groups interact on social media. Leveraging on a unique dataset of 26 million retweets by 120 000 Catalan- and Spanish- speaking Twitter users, we estimate individual exposure to tweets with a network-based model. We then compare two shocks in the same region and year: the Barcelona terror attack and the Catalan independence referendum of 2017. The referendum, and related police violence, triggered a sharp, symmetric jump in retweeting across language groups. The terror attack, by contrast, did not lead to a similar realignment.
Keywords:	echo-chambers, ethno-linguistic conflict, polarization, political conflict, retweet behavior, social media, social networks
JEL:	D74 C55 C45
Date:	2025–09
URL:	https://d.repec.org/n?u=RePEc:bge:wpaper:1505

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.