nep-big New Economics Papers
on Big Data
Issue of 2026–01–12
sixteen papers chosen by
Tom Coupé, University of Canterbury


  1. Risk-Aware Financial Forecasting Enhanced by Machine Learning and Intuitionistic Fuzzy Multi-Criteria Decision-Making By Safiye Turgay; Serkan Erdo\u{g}an; \v{Z}eljko Stevi\'c; Orhan Emre Elma; Tevfik Eren; Zhiyuan Wang; Mahmut Bayda\c{s}
  2. An Efficient Machine Learning Framework for Option Pricing via Fourier Transform By Liying Zhang; Ying Gao
  3. Ultimate Forward Rate Prediction and its Application to Bond Yield Forecasting: A Machine Learning Perspective By Jiawei Du; Yi Hong
  4. Hybrid Quantum-Classical Ensemble Learning for S\&P 500 Directional Prediction By Abraham Itzhak Weinberg
  5. Uncertainty-Adjusted Sorting for Asset Pricing with Machine Learning By Yan Liu; Ye Luo; Zigan Wang; Xiaowei Zhang
  6. xtdml: Double Machine Learning Estimation to Static Panel Data Models with Fixed Effects in R By Annalivia Polselli
  7. A Test of Lookahead Bias in LLM Forecasts By Zhenyu Gao; Wenxi Jiang; Yutong Yan
  8. Cattle Prices Under Arid Conditions: Hedonic and Neural Network Approach By Calil, Yuri Clements Daglia
  9. FedSight AI: Multi-Agent System Architecture for Federal Funds Target Rate Prediction By Yuhan Hou; Tianji Rao; Jeremy Tan; Adler Viton; Xiyue Zhang; David Ye; Abhishek Kodi; Sanjana Dulam; Aditya Paul; Yikai Feng
  10. Generative AI-enhanced Sector-based Investment Portfolio Construction By Alina Voronina; Oleksandr Romanko; Ruiwen Cao; Roy H. Kwon; Rafael Mendoza-Arriaga
  11. Generative Agents and Expectations: Do LLMs Align with Heterogeneous Agent Models? By Filippo Gusella; Eugenio Vicario
  12. Difference-in-Differences using Double Negative Controls and Graph Neural Networks for Unmeasured Network Confounding By Zihan Zhang; Lianyan Fu; Dehui Wang
  13. Advances in Agentic AI: Back to the Future By Sergio Alvarez-Telena; Marta Diez-Fernandez
  14. Fairness-Aware Insurance Pricing: A Multi-Objective Optimization Approach By Tim J. Boonen; Xinyue Fan; Zixiao Quan
  15. Deep Learning for Art Market Valuation By Jianping Mei; Michael Moses; Jan Waelty; Yucheng Yang
  16. Quantitative Financial Modeling for Sri Lankan Markets: Approach Combining NLP, Clustering and Time-Series Forecasting By Linuk Perera

  1. By: Safiye Turgay; Serkan Erdo\u{g}an; \v{Z}eljko Stevi\'c; Orhan Emre Elma; Tevfik Eren; Zhiyuan Wang; Mahmut Bayda\c{s}
    Abstract: In the face of increasing financial uncertainty and market complexity, this study presents a novel risk-aware financial forecasting framework that integrates advanced machine learning techniques with intuitionistic fuzzy multi-criteria decision-making (MCDM). Tailored to the BIST 100 index and validated through a case study of a major defense company in T\"urkiye, the framework fuses structured financial data, unstructured text data, and macroeconomic indicators to enhance predictive accuracy and robustness. It incorporates a hybrid suite of models, including extreme gradient boosting (XGBoost), long short-term memory (LSTM) network, graph neural network (GNN), to deliver probabilistic forecasts with quantified uncertainty. The empirical results demonstrate high forecasting accuracy, with a net profit mean absolute percentage error (MAPE) of 3.03% and narrow 95% confidence intervals for key financial indicators. The risk-aware analysis indicates a favorable risk-return profile, with a Sharpe ratio of 1.25 and a higher Sortino ratio of 1.80, suggesting relatively low downside volatility and robust performance under market fluctuations. Sensitivity analysis shows that the key financial indicator predictions are highly sensitive to variations of inflation, interest rates, sentiment, and exchange rates. Additionally, using an intuitionistic fuzzy MCDM approach, combining entropy weighting, evaluation based on distance from the average solution (EDAS), and the measurement of alternatives and ranking according to compromise solution (MARCOS) methods, the tabular data learning network (TabNet) outperforms the other models and is identified as the most suitable candidate for deployment. Overall, the findings of this work highlight the importance of integrating advanced machine learning, risk quantification, and fuzzy MCDM methodologies in financial forecasting, particularly in emerging markets.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.17936
  2. By: Liying Zhang; Ying Gao
    Abstract: The increasing need for rapid recalibration of option pricing models in dynamic markets places stringent computational demands on data generation and valuation algorithms. In this work, we propose a hybrid algorithmic framework that integrates the smooth offset algorithm (SOA) with supervised machine learning models for the fast pricing of multiple path-independent options under exponential L\'evy dynamics. Building upon the SOA-generated dataset, we train neural networks, random forests, and gradient boosted decision trees to construct surrogate pricing operators. Extensive numerical experiments demonstrate that, once trained, these surrogates achieve order-of-magnitude acceleration over direct SOA evaluation. Importantly, the proposed framework overcomes key numerical limitations inherent to fast Fourier transform-based methods, including the consistency of input data and the instability in deep out-of-the-money option pricing.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.16115
  3. By: Jiawei Du; Yi Hong
    Abstract: This study focuses on forecasting the ultimate forward rate (UFR) and developing a UFRbased bond yield prediction model using data from Chinese treasury bonds and macroeconomic variables spanning from December 2009 to December 2024. The de Kort-Vellekooptype methodology is applied to estimate the UFR, incorporating the optimal turning parameter determination technique proposed in this study, which helps mitigate anomalous fluctuations. In addition, both linear and nonlinear machine learning techniques are employed to forecast the UFR and ultra-long-term bond yields. The results indicate that nonlinear machine learning models outperform their linear counterparts in forecasting accuracy. Incorporating macroeconomic variables, particularly price index-related variables, significantly improves the accuracy of predictions. Finally, a novel UFR-based bond yield forecasting model is developed, demonstrating superior performance across different bond maturities.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2601.00011
  4. By: Abraham Itzhak Weinberg
    Abstract: Financial market prediction is a challenging application of machine learning, where even small improvements in directional accuracy can yield substantial value. Most models struggle to exceed 55--57\% accuracy due to high noise, non-stationarity, and market efficiency. We introduce a hybrid ensemble framework combining quantum sentiment analysis, Decision Transformer architecture, and strategic model selection, achieving 60.14\% directional accuracy on S\&P 500 prediction, a 3.10\% improvement over individual models. Our framework addresses three limitations of prior approaches. First, architecture diversity dominates dataset diversity: combining different learning algorithms (LSTM, Decision Transformer, XGBoost, Random Forest, Logistic Regression) on the same data outperforms training identical architectures on multiple datasets (60.14\% vs.\ 52.80\%), confirmed by correlation analysis ($r>0.6$ among same-architecture models). Second, a 4-qubit variational quantum circuit enhances sentiment analysis, providing +0.8\% to +1.5\% gains per model. Third, smart filtering excludes weak predictors (accuracy $
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.15738
  5. By: Yan Liu; Ye Luo; Zigan Wang; Xiaowei Zhang
    Abstract: Machine learning is central to empirical asset pricing, but portfolio construction still relies on point predictions and largely ignores asset-specific estimation uncertainty. We propose a simple change: sort assets using uncertainty-adjusted prediction bounds instead of point predictions alone. Across a broad set of ML models and a U.S. equity panel, this approach improves portfolio performance relative to point-prediction sorting. These gains persist even when bounds are built from partial or misspecified uncertainty information. They arise mainly from reduced volatility and are strongest for flexible machine learning models. Identification and robustness exercises show that these improvements are driven by asset-level rather than time or aggregate predictive uncertainty.
    Date: 2026–01
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2601.00593
  6. By: Annalivia Polselli
    Abstract: The double machine learning (DML) method combines the predictive power of machine learning with statistical estimation to conduct inference about the structural parameter of interest. This paper presents the R package `xtdml`, which implements DML methods for partially linear panel regression models with low-dimensional fixed effects, high-dimensional confounding variables, proposed by Clarke and Polselli (2025). The package provides functionalities to: (a) learn nuisance functions with machine learning algorithms from the `mlr3` ecosystem, (b) handle unobserved individual heterogeneity choosing among first-difference transformation, within-group transformation, and correlated random effects, (c) transform the covariates with min-max normalization and polynomial expansion to improve learning performance. We showcase the use of `xtdml` with both simulated and real longitudinal data.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.15965
  7. By: Zhenyu Gao; Wenxi Jiang; Yutong Yan
    Abstract: We develop a statistical test to detect lookahead bias in economic forecasts generated by large language models (LLMs). Using state-of-the-art pre-training data detection techniques, we estimate the likelihood that a given prompt appeared in an LLM's training corpus, a statistic we term Lookahead Propensity (LAP). We formally show that a positive correlation between LAP and forecast accuracy indicates the presence and magnitude of lookahead bias, and apply the test to two forecasting tasks: news headlines predicting stock returns and earnings call transcripts predicting capital expenditures. Our test provides a cost-efficient, diagnostic tool for assessing the validity and reliability of LLM-generated forecasts.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.23847
  8. By: Calil, Yuri Clements Daglia
    Keywords: Livestock Production/Industries, Marketing, Agricultural Finance
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea24:343928
  9. By: Yuhan Hou; Tianji Rao; Jeremy Tan; Adler Viton; Xiyue Zhang; David Ye; Abhishek Kodi; Sanjana Dulam; Aditya Paul; Yikai Feng
    Abstract: The Federal Open Market Committee (FOMC) sets the federal funds rate, shaping monetary policy and the broader economy. We introduce \emph{FedSight AI}, a multi-agent framework that uses large language models (LLMs) to simulate FOMC deliberations and predict policy outcomes. Member agents analyze structured indicators and unstructured inputs such as the Beige Book, debate options, and vote, replicating committee reasoning. A Chain-of-Draft (CoD) extension further improves efficiency and accuracy by enforcing concise multistage reasoning. Evaluated at 2023-2024 meetings, FedSight CoD achieved accuracy of 93.75\% and stability of 93.33\%, outperforming baselines including MiniFed and Ordinal Random Forest (RF), while offering transparent reasoning aligned with real FOMC communications.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.15728
  10. By: Alina Voronina; Oleksandr Romanko; Ruiwen Cao; Roy H. Kwon; Rafael Mendoza-Arriaga
    Abstract: This paper investigates how Large Language Models (LLMs) from leading providers (OpenAI, Google, Anthropic, DeepSeek, and xAI) can be applied to quantitative sector-based portfolio construction. We use LLMs to identify investable universes of stocks within S&P 500 sector indices and evaluate how their selections perform when combined with classical portfolio optimization methods. Each model was prompted to select and weight 20 stocks per sector, and the resulting portfolios were compared with their respective sector indices across two distinct out-of-sample periods: a stable market phase (January-March 2025) and a volatile phase (April-June 2025). Our results reveal a strong temporal dependence in LLM portfolio performance. During stable market conditions, LLM-weighted portfolios frequently outperformed sector indices on both cumulative return and risk-adjusted (Sharpe ratio) measures. However, during the volatile period, many LLM portfolios underperformed, suggesting that current models may struggle to adapt to regime shifts or high-volatility environments underrepresented in their training data. Importantly, when LLM-based stock selection is combined with traditional optimization techniques, portfolio outcomes improve in both performance and consistency. This study contributes one of the first multi-model, cross-provider evaluations of generative AI algorithms in investment management. It highlights that while LLMs can effectively complement quantitative finance by enhancing stock selection and interpretability, their reliability remains market-dependent. The findings underscore the potential of hybrid AI-quantitative frameworks, integrating LLM reasoning with established optimization techniques, to produce more robust and adaptive investment strategies.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.24526
  11. By: Filippo Gusella; Eugenio Vicario
    Abstract: Results in the Heterogeneous Agent Model (HAM) literature determine the proportion of fundamentalists and trend followers in the financial market. This proportion varies according to the periods analyzed. In this paper, we use a large language model (LLM) to construct a generative agent (GA) that determines the probability of adopting one of the two strategies based on current information. The probabilities of strategy adoption are compared with those in the HAM literature for the S&P 500 index between 1990 and 2020. Our findings suggest that the resulting artificial intelligence (AI) expectations align with those reported in the HAM literature. At the same time, extending the analysis to artificial market data helps us to filter the decision-making process of the AI agent. In the artificial market, results confirm the heterogeneity in expectations but reveal systematic asymmetry toward the fundamentalist behavior.
    Keywords: Heterogeneous Expectations, Large Language Model, Generative Agent, Funda mentalists, Trend Followers
    JEL: E30 E70 D84
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:frz:wpaper:wp2025_18.rdf
  12. By: Zihan Zhang; Lianyan Fu; Dehui Wang
    Abstract: Estimating causal effects from observational network data faces dual challenges of network interference and unmeasured confounding. To address this, we propose a general Difference-in-Differences framework that integrates double negative controls (DNC) and graph neural networks (GNNs). Based on the modified parallel trends assumption and DNC, semiparametric identification of direct and indirect causal effects is established. We then propose doubly robust estimators. Specifically, an approach combining GNNs with the generalized method of moments is developed to estimate the functions of high-dimensional covariates and network structure. Furthermore, we derive the estimator's asymptotic normality under the $\psi$-network dependence and approximate neighborhood interference. Simulations show the finite-sample performance of our estimators. Finally, we apply our method to analyze the impact of China's green credit policy on corporate green innovation.
    Date: 2026–01
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2601.00603
  13. By: Sergio Alvarez-Telena; Marta Diez-Fernandez
    Abstract: In light of the recent convergence between Agentic AI and our field of Algorithmization, this paper seeks to restore conceptual clarity and provide a structured analytical framework for an increasingly fragmented discourse. First, (a) it examines the contemporary landscape and proposes precise definitions for the key notions involved, ranging from intelligence to Agentic AI. Second, (b) it reviews our prior body of work to contextualize the evolution of methodologies and technological advances developed over the past decade, highlighting their interdependencies and cumulative trajectory. Third, (c) by distinguishing Machine and Learning efforts within the field of Machine Learning (d) it introduces the first Machine in Machine Learning (M1) as the underlying platform enabling today's LLM-based Agentic AI, conceptualized as an extension of B2C information-retrieval user experiences now being repurposed for B2B transformation. Building on this distinction, (e) the white paper develops the notion of the second Machine in Machine Learning (M2) as the architectural prerequisite for holistic, production-grade B2B transformation, characterizing it as Strategies-based Agentic AI and grounding its definition in the structural barriers-to-entry that such systems must overcome to be operationally viable. Further, (f) it offers conceptual and technical insight into what appears to be the first fully realized implementation of an M2. Finally, drawing on the demonstrated accuracy of the two previous decades of professional and academic experience in developing the foundational architectures of Algorithmization, (g) it outlines a forward-looking research and transformation agenda for the coming two decades.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.24856
  14. By: Tim J. Boonen; Xinyue Fan; Zixiao Quan
    Abstract: Machine learning improves predictive accuracy in insurance pricing but exacerbates trade-offs between competing fairness criteria across different discrimination measures, challenging regulators and insurers to reconcile profitability with equitable outcomes. While existing fairness-aware models offer partial solutions under GLM and XGBoost estimation methods, they remain constrained by single-objective optimization, failing to holistically navigate a conflicting landscape of accuracy, group fairness, individual fairness, and counterfactual fairness. To address this, we propose a novel multi-objective optimization framework that jointly optimizes all four criteria via the Non-dominated Sorting Genetic Algorithm II (NSGA-II), generating a diverse Pareto front of trade-off solutions. We use a specific selection mechanism to extract a premium on this front. Our results show that XGBoost outperforms GLM in accuracy but amplifies fairness disparities; the Orthogonal model excels in group fairness, while Synthetic Control leads in individual and counterfactual fairness. Our method consistently achieves a balanced compromise, outperforming single-model approaches.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.24747
  15. By: Jianping Mei; Michael Moses; Jan Waelty; Yucheng Yang
    Abstract: We study how deep learning can improve valuation in the art market by incorporating the visual content of artworks into predictive models. Using a large repeated-sales dataset from major auction houses, we benchmark classical hedonic regressions and tree-based methods against modern deep architectures, including multi-modal models that fuse tabular and image data. We find that while artist identity and prior transaction history dominate overall predictive power, visual embeddings provide a distinct and economically meaningful contribution for fresh-to-market works where historical anchors are absent. Interpretability analyses using Grad-CAM and embedding visualizations show that models attend to compositional and stylistic cues. Our findings demonstrate that multi-modal deep learning delivers significant value precisely when valuation is hardest, namely first-time sales, and thus offers new insights for both academic research and practice in art market valuation.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.23078
  16. By: Linuk Perera
    Abstract: This research introduces a novel quantitative methodology tailored for quantitative finance applications, enabling banks, stockbrokers, and investors to predict economic regimes and market signals in emerging markets, specifically Sri Lankan stock indices (S&P SL20 and ASPI) by integrating Environmental, Social, and Governance (ESG) sentiment analysis with macroeconomic indicators and advanced time-series forecasting. Designed to leverage quantitative techniques for enhanced risk assessment, portfolio optimization, and trading strategies in volatile environments, the architecture employs FinBERT, a transformer-based NLP model, to extract sentiment from ESG texts, followed by unsupervised clustering (UMAP/HDBSCAN) to identify 5 latent ESG regimes, validated via PCA. These regimes are mapped to economic conditions using a dense neural network and gradient boosting classifier, achieving 84.04% training and 82.0% validation accuracy. Concurrently, time-series models (SRNN, MLP, LSTM, GRU) forecast daily closing prices, with GRU attaining an R-squared of 0.801 and LSTM delivering 52.78% directional accuracy on intraday data. A strong correlation between S&P SL20 and S&P 500, observed through moving average and volatility trend plots, further bolsters forecasting precision. A rule-based fusion logic merges ESG and time-series outputs for final market signals. By addressing literature gaps that overlook emerging markets and holistic integration, this quant-driven framework combines global correlations and local sentiment analysis to offer scalable, accurate tools for quantitative finance professionals navigating complex markets like Sri Lanka.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.20216

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.