nep-big New Economics Papers
on Big Data
Issue of 2025–11–10
fourteen papers chosen by
Tom Coupé, University of Canterbury


  1. Exact Terminal Condition Neural Network for American Option Pricing Based on the Black-Scholes-Merton Equations By Wenxuan Zhang; Yixiao Guo; Benzhuo Lu
  2. Technical Analysis Meets Machine Learning: Bitcoin Evidence By Jos\'e \'Angel Islas Anguiano; Andr\'es Garc\'ia-Medina
  3. Direct Debiased Machine Learning via Bregman Divergence Minimization By Masahiro Kato
  4. Causal and Predictive Modeling of Short-Horizon Market Risk and Systematic Alpha Generation Using Hybrid Machine Learning Ensembles By Aryan Ranjan
  5. A Quantitative Approach to Central Bank Haircuts and Counterparty Risk Management By Yuji Sakurai
  6. Europe in the Headlines: What Two Decades of French News Reveal about EU Sentiment By Camille Jehle; Florian Le Gallo
  7. ChatGPT in Systematic Investing -- Enhancing Risk-Adjusted Returns with LLMs By Nikolas Anic; Andrea Barbon; Ralf Seiz; Carlo Zarattini
  8. The Prestakes of Stock Market Investing By Francesco Bianchi; Do Q. Lee; Sydney C. Ludvigson; Sai Ma
  9. TABL-ABM: A Hybrid Framework for Synthetic LOB Generation By Ollie Olby; Rory Baggott; Namid Stillman
  10. Environmental Complexity and Respiratory Health: A Data-Driven Exploration Across European Regions By Resta, Onofrio; Resta, Emanuela; Costantiello, Alberto; Liuzzi, Piergiuseppe; Leogrande, Angelo
  11. Monetary Policy Shocks: A New Hope. Large Language Models and Central Bank Communication. By Rubén Fernández-Fuertes
  12. Nearest Neighbor Matching as Least Squares Density Ratio Estimation and Riesz Regression By Masahiro Kato
  13. Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas By Giulia Iadisernia; Carolina Camassa
  14. Predicting Household Water Consumption Using Satellite and Street View Images in Two Indian Cities By Qiao Wang; Joseph George

  1. By: Wenxuan Zhang; Yixiao Guo; Benzhuo Lu
    Abstract: This paper proposes the Exact Terminal Condition Neural Network (ETCNN), a deep learning framework for accurately pricing American options by solving the Black-Scholes-Merton (BSM) equations. The ETCNN incorporates carefully designed functions that ensure the numerical solution not only exactly satisfies the terminal condition of the BSM equations but also matches the non-smooth and singular behavior of the option price near expiration. This method effectively addresses the challenges posed by the inequality constraints in the BSM equations and can be easily extended to high-dimensional scenarios. Additionally, input normalization is employed to maintain the homogeneity. Multiple experiments are conducted to demonstrate that the proposed method achieves high accuracy and exhibits robustness across various situations, outperforming both traditional numerical methods and other machine learning approaches.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.27132
  2. By: Jos\'e \'Angel Islas Anguiano; Andr\'es Garc\'ia-Medina
    Abstract: In this note, we compare Bitcoin trading performance using two machine learning models-Light Gradient Boosting Machine (LightGBM) and Long Short-Term Memory (LSTM)-and two technical analysis-based strategies: Exponential Moving Average (EMA) crossover and a combination of Moving Average Convergence/Divergence with the Average Directional Index (MACD+ADX). The objective is to evaluate how trading signals can be used to maximize profits in the Bitcoin market. This comparison was motivated by the U.S. Securities and Exchange Commission's (SEC) approval of the first spot Bitcoin exchange-traded funds (ETFs) on 2024-01-10. Our results show that the LSTM model achieved a cumulative return of approximately 65.23% in under a year, significantly outperforming LightGBM, the EMA and MACD+ADX strategies, as well as the baseline buy-and-hold. This study highlights the potential for deeper integration of machine learning and technical analysis in the rapidly evolving cryptocurrency landscape.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.00665
  3. By: Masahiro Kato
    Abstract: We develop a direct debiased machine learning framework comprising Neyman targeted estimation and generalized Riesz regression. Our framework unifies Riesz regression for automatic debiased machine learning, covariate balancing, targeted maximum likelihood estimation (TMLE), and density-ratio estimation. In many problems involving causal effects or structural models, the parameters of interest depend on regression functions. Plugging regression functions estimated by machine learning methods into the identifying equations can yield poor performance because of first-stage bias. To reduce such bias, debiased machine learning employs Neyman orthogonal estimating equations. Debiased machine learning typically requires estimation of the Riesz representer and the regression function. For this problem, we develop a direct debiased machine learning framework with an end-to-end algorithm. We formulate estimation of the nuisance parameters, the regression function and the Riesz representer, as minimizing the discrepancy between Neyman orthogonal scores computed with known and unknown nuisance parameters, which we refer to as Neyman targeted estimation. Neyman targeted estimation includes Riesz representer estimation, and we measure discrepancies using the Bregman divergence. The Bregman divergence encompasses various loss functions as special cases, where the squared loss yields Riesz regression and the Kullback-Leibler divergence yields entropy balancing. We refer to this Riesz representer estimation as generalized Riesz regression. Neyman targeted estimation also yields TMLE as a special case for regression function estimation. Furthermore, for specific pairs of models and Riesz representer estimation methods, we can automatically obtain the covariate balancing property without explicitly solving the covariate balancing objective.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.23534
  4. By: Aryan Ranjan
    Abstract: We present a systematic trading framework that forecasts short-horizon market risk, identifies its underlying drivers, and generates alpha using a hybrid machine learning ensemble built to trade on the resulting signal. The framework integrates neural networks with tree-based voting models to predict five-day drawdowns in the S&P 500 ETF, leveraging a cross-asset feature set spanning equities, fixed income, foreign exchange, commodities, and volatility markets. Interpretable feature attribution methods reveal the key macroeconomic and microstructural factors that differentiate high-risk (crash) from benign (non-crash) weekly regimes. Empirical results show a Sharpe ratio of 2.51 and an annualized CAPM alpha of +0.28, with a market beta of 0.51, indicating that the model delivers substantial systematic alpha with limited directional exposure during the 2005--2025 backtest period. Overall, the findings underscore the effectiveness of hybrid ensemble architectures in capturing nonlinear risk dynamics and identifying interpretable, potentially causal drivers, providing a robust blueprint for machine learning-driven alpha generation in systematic trading.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.22348
  5. By: Yuji Sakurai
    Abstract: This paper presents a comprehensive framework for determining haircuts on collateral used in central bank operations, quantifying residual uncollateralized exposures, and validating haircut models using machine learning. First, it introduces four haircut model types tailored to asset characteristics—marketable or non-marketable—and data availability. It proposes a novel model for setting haircuts in data-limited environment using a satallite cross-country model. Key principles guiding haircut calibration include non-procyclicality, data-drivenness, conservatism, and the avoidance of arbitrage gaps. The paper details model inputs such as Value-at-Risk (VaR) percentiles, volatility measures, and time to liquidation. Second, it proposes a quantitative framework for estimating expected uncollateralized exposures that remain after haircut application, emphasizing their importance in stress scenarios. Illustrative simulations using dynamic Nelson-Siegel yield curve models demonstrate how volatility impacts exposure. Third, the paper explores the use of Variational Autoencoders (VAEs) to simulate stress scenarios for bond yields. Trained on U.S. Treasury data, VAEs capture realistic yield curve distributions, offering an altenative tool for validating VaR-based haircuts. Although interpretability and explainability remain concerns, machine learning models enhance risk assessment by uncovering potential model vulnerabilities.
    Keywords: Haircuts; Uncollateralized Exposure; Machine Learning
    Date: 2025–10–31
    URL: https://d.repec.org/n?u=RePEc:imf:imfwpa:2025/225
  6. By: Camille Jehle; Florian Le Gallo
    Abstract: Using a large language model, we build a unique 400, 000 corpus of articles related to the European Union (EU) published between 2005 and 2023 in more than 100 French local and national newspapers. Drawing on this dataset, we show that the interest of French newspapers in European issues has remained stable since 2005 and is primarily driven by the European elections every 5 years. An analysis of polarity and topics covered reveals that the local press pays greater attention to tangible EU initiatives—such as cultural exchange programs—which are generally portrayed in a positive light. Finally, we show that French media sentiment towards the European Union deteriorated significantly following the financial and sovereign debt crises, mirroring the trend observed in Eurobarometer opinion surveys on EU sentiment. However, from 2013 onward, a divergence emerged since sentiment in the press gradually returns to pre-crisis levels while public image of the European Union in the opinion remains below these levels. Focusing on the Euro area, we do not observe such a divergence.
    Keywords: Sentiment Indicator, European Sentiment, Press Text Mining
    JEL: C55 F59
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:bfr:banfra:1008
  7. By: Nikolas Anic; Andrea Barbon; Ralf Seiz; Carlo Zarattini
    Abstract: This paper investigates whether large language models (LLMs) can improve cross-sectional momentum strategies by extracting predictive signals from firm-specific news. We combine daily U.S. equity returns for S&P 500 constituents with high-frequency news data and use prompt-engineered queries to ChatGPT that inform the model when a stock is about to enter a momentum portfolio. The LLM evaluates whether recent news supports a continuation of past returns, producing scores that condition both stock selection and portfolio weights. An LLM-enhanced momentum strategy outperforms a standard long-only momentum benchmark, delivering higher Sharpe and Sortino ratios both in-sample and in a truly out-of-sample period after the model's pre-training cut-off. These gains are robust to transaction costs, prompt design, and portfolio constraints, and are strongest for concentrated, high-conviction portfolios. The results suggest that LLMs can serve as effective real-time interpreters of financial news, adding incremental value to established factor-based investment strategies.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.26228
  8. By: Francesco Bianchi; Do Q. Lee; Sydney C. Ludvigson; Sai Ma
    Abstract: How rational is the stock market and how efficiently does it process information? We use machine learning to establish a practical measure of rational and efficient expectation formation while identifying distortions and inefficiencies in the subjective beliefs of market participants. The algorithm independently learns, stays attentive to fundamentals, credit risk, and sentiment, and makes abrupt course-corrections at critical junctures. By contrast, the subjective beliefs of investors, professionals, and equity analysts do little of this and instead contain predictable mistakes–prestakes–that are especially prevalent in times of market turbulence. Trading schemes that bet against prestakes deliver defensive strategies with large CAPM and Fama-French 5-factor alphas.
    JEL: G1 G17 G40 G41
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34420
  9. By: Ollie Olby; Rory Baggott; Namid Stillman
    Abstract: The recent application of deep learning models to financial trading has heightened the need for high fidelity financial time series data. This synthetic data can be used to supplement historical data to train large trading models. The state-of-the-art models for the generative application often rely on huge amounts of historical data and large, complicated models. These models range from autoregressive and diffusion-based models through to architecturally simpler models such as the temporal-attention bilinear layer. Agent-based approaches to modelling limit order book dynamics can also recreate trading activity through mechanistic models of trader behaviours. In this work, we demonstrate how a popular agent-based framework for simulating intraday trading activity, the Chiarella model, can be combined with one of the most performant deep learning models for forecasting multi-variate time series, the TABL model. This forecasting model is coupled to a simulation of a matching engine with a novel method for simulating deleted order flow. Our simulator gives us the ability to test the generative abilities of the forecasting model using stylised facts. Our results show that this methodology generates realistic price dynamics however, when analysing deeper, parts of the markets microstructure are not accurately recreated, highlighting the necessity for including more sophisticated agent behaviors into the modeling framework to help account for tail events.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.22685
  10. By: Resta, Onofrio; Resta, Emanuela; Costantiello, Alberto; Liuzzi, Piergiuseppe; Leogrande, Angelo
    Abstract: This paper examines the environmental and infrastructure determinants of respiratory disease mortality (TRD) across European nation-states through an original combination of econometric, machine learning, clustering, and network-based approaches. The primary scientific inquiry is how structural environmental variables, such as land use, energy mix, sanitation, and climatic stress, co-interact to affect respiratory mortality across regions. Although prior literature has addressed individual environmental predictors in singleton settings, this paper fills an integral gap by using a multi-method, systems-level analysis that accounts for interdependencies as well as contextual variability. The statistical analysis draws on panel data covering several years and nation-states using fixed effects regressions with robust standard errors for evaluating the effects of variables such as agricultural land use (AGRL), access to electricity (ELEC), renewable energy (RENE), freshwater withdrawals (WTRW), cooling degree days (CDD), and sanitation (SANS). We employ cluster analysis and density-based methodology to identify spatial and environmental groupings, while machine learning regressions—specifically, K-Nearest Neighbors (KNN)—are utilized for predictive modeling and evaluating feature importance. Lastly, network analysis identifies the structural connections between variables, including influence metrics and directional weights. We obtain the following results: Consistently, across all regressions, AGRL, WTRW, and SANS feature importantly when determining the effect for TRD. Consistently across all networks, influencer metrics identify AGRL, WTRW, and SANS as key influencers. Consistently across all models, the best-performing predictive regression identifies the nonlinear (polynomial or non-monotone), context-sensitive nature of the effects. Consistent with the network results, the influencer metrics suggest strong connections between variables, with a particular emphasis on the importance of holistic environmental health approaches. Combining the disparate yet complementary methodological tools, the paper provides robust, understandable, yet policy-relevant insights into the environmental complexity driving respiratory health outcomes across Europe.
    Keywords: Respiratory Disease Mortality, Environmental Determinants, Machine Learning Regression, Network Analysis, Panel Data Models
    JEL: C23 C38 C45 I0 I00 I1 I10
    Date: 2025–09–01
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:126073
  11. By: Rubén Fernández-Fuertes
    Abstract: I develop a multi-agent LLM framework that processes Federal Reserve communications to construct narrative monetary policy surprises. By analyzing Beige Books and Minutes released before each FOMC meeting, the system generates conditional expectations that yield less noisy surprises than market-based measures. These surprises produce theoretically consistent impulse responses where contractionary shocks generate persistent disinflationary effects and enable profitable yield curve trading strategies that outperform alternatives. By directly extracting expectations rather than cleaning surprises ex post, this approach demonstrates how multi-agent LLMs can implement narrative identification at scale without contamination in high-frequency measures.
    Keywords: Monetary Policy Shocks, Central Bank Communication, Large Language Models, FOMC, Federal Reserve, Natural Language Processing, High-Frequency Identification, Term Structure
    JEL: E52 E58 E43 G14 C45 C55
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:baf:cbafwp:cbafwp25257
  12. By: Masahiro Kato
    Abstract: This study proves that Nearest Neighbor (NN) matching can be interpreted as an instance of Riesz regression for automatic debiased machine learning. Lin et al. (2023) shows that NN matching is an instance of density-ratio estimation with their new density-ratio estimator. Chernozhukov et al. (2024) develops Riesz regression for automatic debiased machine learning, which directly estimates the Riesz representer (or equivalently, the bias-correction term) by minimizing the mean squared error. In this study, we first prove that the density-ratio estimation method proposed in Lin et al. (2023) is essentially equivalent to Least-Squares Importance Fitting (LSIF) proposed in Kanamori et al. (2009) for direct density-ratio estimation. Furthermore, we derive Riesz regression using the LSIF framework. Based on these results, we derive NN matching from Riesz regression. This study is based on our work Kato (2025a) and Kato (2025b).
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.24433
  13. By: Giulia Iadisernia; Carolina Camassa
    Abstract: We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2, 368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.02458
  14. By: Qiao Wang; Joseph George
    Abstract: Monitoring household water use in rapidly urbanizing regions is hampered by costly, time-intensive enumeration methods and surveys. We investigate whether publicly available imagery-satellite tiles, Google Street View (GSV) segmentation-and simple geospatial covariates (nightlight intensity, population density) can be utilized to predict household water consumption in Hubballi-Dharwad, India. We compare four approaches: survey features (benchmark), CNN embeddings (satellite, GSV, combined), and GSV semantic maps with auxiliary data. Under an ordinal classification framework, GSV segmentation plus remote-sensing covariates achieves 0.55 accuracy for water use, approaching survey-based models (0.59 accuracy). Error analysis shows high precision at extremes of the household water consumption distribution, but confusion among middle classes is due to overlapping visual proxies. We also compare and contrast our estimates for household water consumption to that of household subjective income. Our findings demonstrate that open-access imagery, coupled with minimal geospatial data, offers a promising alternative to obtaining reliable household water consumption estimates using surveys in urban analytics.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.26957

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.