nep-big New Economics Papers
on Big Data
Issue of 2025–10–13
nineteen papers chosen by
Tom Coupé, University of Canterbury


  1. Can Machine Learning Algorithms Outperform Traditional Models for Option Pricing? By Georgy Milyushkov
  2. Functional effects models: Accounting for preference heterogeneity in panel data with machine learning By Nicolas Salvad\'e; Tim Hillel
  3. Mamba Outpaces Reformer in Stock Prediction with Sentiments from Top Ten LLMs By Lokesh Antony Kadiyala; Amir Mirzaeinia
  4. What Can Satellite Imagery and Machine Learning Measure? By Jonathan Proctor; Tamma Carleton; Trinetta Chong; Taryn Fransen; Simon Greenhill; Jessica Katz; Hikari Murayama; Luke Sherman; Jeanette Tseng; Hannah Druckenmiller; Solomon Hsiang
  5. Predicting Credit Spreads and Ratings with Machine Learning: The Role of Non-Financial Data By Yanran Wu; Xinlei Zhang; Quanyi Xu; Qianxin Yang; Chao Zhang
  6. Parsing the pulse: decomposing macroeconomic sentiment with LLMs By Byeungchun Kwon; Taejin Park; Phurichai Rungcharoenkitkul; Frank Smets
  7. Multi-Agent Analysis of Off-Exchange Public Information for Cryptocurrency Market Trend Prediction By Kairan Hong; Jinling Gan; Qiushi Tian; Yanglinxuan Guo; Rui Guo; Runnan Li
  8. Minimizing the Value-at-Risk of Loan Portfolio via Deep Neural Networks By Albert Di Wang; Ye Du
  9. Private and public school efficiency gaps in Latin America-A combined DEA and machine learning approach based on PISA 2022 By Marcos Delprato
  10. How human is the machine? Evidence from 66, 000 Conversations with Large Language Models By Antonios Stamatogiannakis; Arsham Ghodsinia; Sepehr Etminanrad; Dilney Gon\c{c}alves; David Santos
  11. LEMs: A Primer On Large Execution Models By Remi Genet; Hugo Inzirillo
  12. Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making By Ziv Ben-Zion; Zohar Elyoseph; Tobias Spiller; Teddy Lazebnik
  13. Understanding sentiments and discourse surrounding the Make America Healthy Again (#MAHA) movement on social media with pre-trained language models By Alba, Charles
  14. Does FOMC Tone Really Matter? Statistical Evidence from Spectral Graph Network Analysis By Jaeho Choi; Jaewon Kim; Seyoung Chung; Chae-shick Chung; Yoonsoo Lee
  15. FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling By Avinash Kumar Singh; Bhaskarjit Sarmah; Stefano Pasquali
  16. An Adaptive Multi Agent Bitcoin Trading System By Aadi Singhi
  17. Financial Stability Implications of Generative AI: Taming the Animal Spirits By Anne Lundgaard Hansen; Seung Jung Lee
  18. Increase Alpha: Performance and Risk of an AI-Driven Trading Framework By Sid Ghatak; Arman Khaledian; Navid Parvini; Nariman Khaledian
  19. Interpretable Kernels By Patrick J.F. Groenen; Michael Greenacre

  1. By: Georgy Milyushkov
    Abstract: This study investigates the application of machine learning techniques, specifically Neural Networks, Random Forests, and CatBoost for option pricing, in comparison to traditional models such as Black-Scholes and Heston Model. Using both synthetically generated data and real market option data, each model is evaluated in predicting the option price. The results show that machine learning models can capture complex, non-linear relationships in option prices and, in several cases, outperform both Black-Scholes and Heston models. These findings highlight the potential of data-driven methods to improve pricing accuracy and better reflect market dynamics.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.01446
  2. By: Nicolas Salvad\'e; Tim Hillel
    Abstract: In this paper, we present a general specification for Functional Effects Models, which use Machine Learning (ML) methodologies to learn individual-specific preference parameters from socio-demographic characteristics, therefore accounting for inter-individual heterogeneity in panel choice data. We identify three specific advantages of the Functional Effects Model over traditional fixed, and random/mixed effects models: (i) by mapping individual-specific effects as a function of socio-demographic variables, we can account for these effects when forecasting choices of previously unobserved individuals (ii) the (approximate) maximum-likelihood estimation of functional effects avoids the incidental parameters problem of the fixed effects model, even when the number of observed choices per individual is small; and (iii) we do not rely on the strong distributional assumptions of the random effects model, which may not match reality. We learn functional intercept and functional slopes with powerful non-linear machine learning regressors for tabular data, namely gradient boosting decision trees and deep neural networks. We validate our proposed methodology on a synthetic experiment and three real-world panel case studies, demonstrating that the Functional Effects Model: (i) can identify the true values of individual-specific effects when the data generation process is known; (ii) outperforms both state-of-the-art ML choice modelling techniques that omit individual heterogeneity in terms of predictive performance, as well as traditional static panel choice models in terms of learning inter-individual heterogeneity. The results indicate that the FI-RUMBoost model, which combines the individual-specific constants of the Functional Effects Model with the complex, non-linear utilities of RUMBoost, performs marginally best on large-scale revealed preference panel data.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.18047
  3. By: Lokesh Antony Kadiyala; Amir Mirzaeinia
    Abstract: The stock market is extremely difficult to predict in the short term due to high market volatility, changes caused by news, and the non-linear nature of the financial time series. This research proposes a novel framework for improving minute-level prediction accuracy using semantic sentiment scores from top ten different large language models (LLMs) combined with minute interval intraday stock price data. We systematically constructed a time-aligned dataset of AAPL news articles and 1-minute Apple Inc. (AAPL) stock prices for the dates of April 4 to May 2, 2025. The sentiment analysis was achieved using the DeepSeek-V3, GPT variants, LLaMA, Claude, Gemini, Qwen, and Mistral models through their APIs. Each article obtained sentiment scores from all ten LLMs, which were scaled to a [0, 1] range and combined with prices and technical indicators like RSI, ROC, and Bollinger Band Width. Two state-of-the-art such as Reformer and Mamba were trained separately on the dataset using the sentiment scores produced by each LLM as input. Hyper parameters were optimized by means of Optuna and were evaluated through a 3-day evaluation period. Reformer had mean squared error (MSE) or the evaluation metrics, and it should be noted that Mamba performed not only faster but also better than Reformer for every LLM across the 10 LLMs tested. Mamba performed best with LLaMA 3.3--70B, with the lowest error of 0.137. While Reformer could capture broader trends within the data, the model appeared to over smooth sudden changes by the LLMs. This study highlights the potential of integrating LLM-based semantic analysis paired with efficient temporal modeling to enhance real-time financial forecasting.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.01203
  4. By: Jonathan Proctor; Tamma Carleton; Trinetta Chong; Taryn Fransen; Simon Greenhill; Jessica Katz; Hikari Murayama; Luke Sherman; Jeanette Tseng; Hannah Druckenmiller; Solomon Hsiang
    Abstract: Satellite imagery and machine learning (SIML) are increasingly being combined to remotely measure social and environmental outcomes, yet use of this technology has been limited by insufficient understanding of its strengths and weaknesses. Here, we undertake the most extensive effort yet to characterize the potential and limits of using a SIML technology to measure ground conditions. We conduct 115 standardized large-scale experiments using a composite high-resolution optical image of Earth and a generalizable SIML technology to evaluate what can be accurately measured and where this technology struggles. We find that SIML alone predicts roughly half the variation in ground measurements on average, and that variables describing human society (e.g. female literacy, R²=0.55) are generally as easily measured as natural variables (e.g. bird diversity, R²=0.55). Patterns of performance across measured variable type, space, income and population density indicate that SIML can likely support many new applications and decision-making use cases, although within quantifiable limits.
    JEL: C80 Q5
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34315
  5. By: Yanran Wu; Xinlei Zhang; Quanyi Xu; Qianxin Yang; Chao Zhang
    Abstract: We build a 167-indicator comprehensive credit risk indicator set, integrating macro, corporate financial, bond-specific indicators, and for the first time, 30 large-scale corporate non-financial indicators. We use seven machine learning models to construct a bond credit spread prediction model, test their spread predictive power and economic mechanisms, and verify their credit rating prediction effectiveness. Results show these models outperform Chinese credit rating agencies in explaining credit spreads. Specially, adding non-financial indicators more than doubles their out-of-sample performance vs. traditional feature-driven models. Mechanism analysis finds non-financial indicators far more important than traditional ones (macro-level, financial, bond features)-seven of the top 10 are non-financial (e.g., corporate governance, property rights nature, information disclosure evaluation), the most stable predictors. Models identify high-risk traits (deteriorating operations, short-term debt, higher financing constraints) via these indicators for spread prediction and risk identification. Finally, we pioneer a credit rating model using predicted spreads (predicted implied rating model), with full/sub-industry models achieving over 75% accuracy, recall, F1. This paper provides valuable guidance for bond default early warning, credit rating, and financial stability.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.19042
  6. By: Byeungchun Kwon; Taejin Park; Phurichai Rungcharoenkitkul; Frank Smets
    Abstract: Macroeconomic indicators provide quantitative signals that must be pieced together and interpreted by economists. We propose a reversed approach of parsing press narratives directly using Large Language Models (LLM) to recover growth and inflation sentiment indices. A key advantage of this LLM-based approach is the ability to decompose aggregate sentiment into its drivers, readily enabling an interpretation of macroeconomic dynamics. Our sentiment indices track hard-data counterparts closely, providing an accurate, near real-time picture of the macroeconomy. Their components–demand, supply, and deeper structural forces–are intuitive and consistent with prior model-based studies. Incorporating sentiment indices improves the forecasting performance of simple statistical models, pointing to information unspanned by traditional data.
    Keywords: macroeconomic sentiment, growth, inflation, monetary policy, fiscal policy, LLMs, machine learning
    JEL: E30 E44 E60 C55 C82
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:bis:biswps:1294
  7. By: Kairan Hong; Jinling Gan; Qiushi Tian; Yanglinxuan Guo; Rui Guo; Runnan Li
    Abstract: Cryptocurrency markets present unique prediction challenges due to their extreme volatility, 24/7 operation, and hypersensitivity to news events, with existing approaches suffering from key information extraction and poor sideways market detection critical for risk management. We introduce a theoretically-grounded multi-agent cryptocurrency trend prediction framework that advances the state-of-the-art through three key innovations: (1) an information-preserving news analysis system with formal theoretical guarantees that systematically quantifies market impact, regulatory implications, volume dynamics, risk assessment, technical correlation, and temporal effects using large language models; (2) an adaptive volatility-conditional fusion mechanism with proven optimal properties that dynamically combines news sentiment and technical indicators based on market regime detection; (3) a distributed multi-agent coordination architecture with low communication complexity enabling real-time processing of heterogeneous data streams. Comprehensive experimental evaluation on Bitcoin across three prediction horizons demonstrates statistically significant improvements over state-of-the-art natural language processing baseline, establishing a new paradigm for financial machine learning with broad implications for quantitative trading and risk management systems.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.08268
  8. By: Albert Di Wang; Ye Du
    Abstract: Risk management is a prominent issue in peer-to-peer lending. An investor may naturally reduce his risk exposure by diversifying instead of putting all his money on one loan. In that case, an investor may want to minimize the Value-at-Risk (VaR) or Conditional Value-at-Risk (CVaR) of his loan portfolio. We propose a low degree of freedom deep neural network model, DeNN, as well as a high degree of freedom model, DSNN, to tackle the problem. In particular, our models predict not only the default probability of a loan but also the time when it will default. The experiments demonstrate that both models can significantly reduce the portfolio VaRs at different confidence levels, compared to benchmarks. More interestingly, the low degree of freedom model, DeNN, outperforms DSNN in most scenarios.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.07444
  9. By: Marcos Delprato
    Abstract: Latin America's education systems are fragmented and segregated, with substantial differences by school type. The concept of school efficiency (the ability of school to produce the maximum level of outputs given available resources) is policy relevant due to scarcity of resources in the region. Knowing whether private and public schools are making an efficient use of resources -- and which are the leading drivers of efficiency -- is critical, even more so after the learning crisis brought by the COVID-19 pandemic. In this paper, relying on data of 2, 034 schools and nine Latin American countries from PISA 2022, I offer new evidence on school efficiency (both on cognitive and non-cognitive dimensions) using Data Envelopment Analysis (DEA) by school type and, then, I estimate efficiency leading determinants through interpretable machine learning methods (IML). This hybrid DEA-IML approach allows to accommodate the issue of big data (jointly assessing several determinants of school efficiency). I find a cognitive efficiency gap of nearly 0.10 favouring private schools and of 0.045 for non-cognitive outcomes, and with a lower heterogeneity in private than public schools. For cognitive efficiency, leading determinants for the chance of a private school of being highly efficient are higher stock of books and PCs at home, lack of engagement in paid work and school's high autonomy; whereas low-efficient public schools are shaped by poor school climate, large rates of repetition, truancy and intensity of paid work, few books at home and increasing barriers for homework during the pandemic.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.25353
  10. By: Antonios Stamatogiannakis; Arsham Ghodsinia; Sepehr Etminanrad; Dilney Gon\c{c}alves; David Santos
    Abstract: When Artificial Intelligence (AI) is used to replace consumers (e.g., synthetic data), it is often assumed that AI emulates established consumers, and more generally human behaviors. Ten experiments with Large Language Models (LLMs) investigate if this is true in the domain of well-documented biases and heuristics. Across studies we observe four distinct types of deviations from human-like behavior. First, in some cases, LLMs reduce or correct biases observed in humans. Second, in other cases, LLMs amplify these same biases. Third, and perhaps most intriguingly, LLMs sometimes exhibit biases opposite to those found in humans. Fourth, LLMs' responses to the same (or similar) prompts tend to be inconsistent (a) within the same model after a time delay, (b) across models, and (c) among independent research studies. Such inconsistencies can be uncharacteristic of humans and suggest that, at least at one point, LLMs' responses differed from humans. Overall, unhuman-like responses are problematic when LLMs are used to mimic or predict consumer behavior. These findings complement research on synthetic consumer data by showing that sources of bias are not necessarily human-centric. They also contribute to the debate about the tasks for which consumers, and more generally humans, can be replaced by AI.
    Date: 2025–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.07321
  11. By: Remi Genet; Hugo Inzirillo
    Abstract: This paper introduces Large Execution Models (LEMs), a novel deep learning framework that extends transformer-based architectures to address complex execution problems with flexible time boundaries and multiple execution constraints. Building upon recent advances in neural VWAP execution strategies, LEMs generalize the approach from fixed-duration orders to scenarios where execution duration is bounded between minimum and maximum time horizons, similar to share buyback contract structures. The proposed architecture decouples market information processing from execution allocation decisions: a common feature extraction pipeline using Temporal Kolmogorov-Arnold Networks (TKANs), Variable Selection Networks (VSNs), and multi-head attention mechanisms processes market data to create informational context, while independent allocation networks handle the specific execution logic for different scenarios (fixed quantity vs. fixed notional, buy vs. sell orders). This architectural separation enables a unified model to handle diverse execution objectives while leveraging shared market understanding across scenarios. Through comprehensive empirical evaluation on intraday cryptocurrency markets and multi-day equity trading using DOW Jones constituents, we demonstrate that LEMs achieve superior execution performance compared to traditional benchmarks by dynamically optimizing execution paths within flexible time constraints. The unified model architecture enables deployment across different execution scenarios (buy/sell orders, varying duration boundaries, volume/notional targets) through a single framework, providing significant operational advantages over asset-specific approaches.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.25211
  12. By: Ziv Ben-Zion; Zohar Elyoseph; Tobias Spiller; Teddy Lazebnik
    Abstract: Large language models (LLMs) are rapidly evolving from text generators to autonomous agents, raising urgent questions about their reliability in real-world contexts. Stress and anxiety are well known to bias human decision-making, particularly in consumer choices. Here, we tested whether LLM agents exhibit analogous vulnerabilities. Three advanced models (ChatGPT-5, Gemini 2.5, Claude 3.5-Sonnet) performed a grocery shopping task under budget constraints (24, 54, 108 USD), before and after exposure to anxiety-inducing traumatic narratives. Across 2, 250 runs, traumatic prompts consistently reduced the nutritional quality of shopping baskets (Change in Basket Health Scores of -0.081 to -0.126; all pFDR
    Date: 2025–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.06222
  13. By: Alba, Charles
    Abstract: The Make America Healthy Again (MAHA) movement highlights the complexity of public perceptions surrounding policies rooted in populist health initiatives in the United States. While supporters embrace the movement’s messaging around healthier lifestyles and nutritional choices, critics point to its political undertones and the widespread misinformation associated with it. To better understand the dynamics of public perceptions of the MAHA movement, our study leverages sentiment classification and topic modeling to analyze discourse about the movement on X. Our analysis reveals an interesting trend: although the movement initially generated overwhelmingly positive sentiment, overall positivity declined following the nomination and subsequent confirmation of Robert F. Kennedy Jr. as President Trump’s Secretary of Health and Human Services to the point where negative sentiment began to outweigh positive sentiments. Further integration of such trends with topic modeling analysis suggests that while the public initially supported the ideals of healthier lifestyles promoted by the MAHA movement, enthusiasm diminished and eventually turned negative once the movement became entangled with political dimensions and controversial policies.
    Date: 2025–10–02
    URL: https://d.repec.org/n?u=RePEc:osf:socarx:2gsxq_v1
  14. By: Jaeho Choi; Jaewon Kim; Seyoung Chung; Chae-shick Chung; Yoonsoo Lee
    Abstract: This study examines the relationship between Federal Open Market Committee (FOMC) announcements and financial market network structure through spectral graph theory. Using hypergraph networks constructed from S\&P 100 stocks around FOMC announcement dates (2011--2024), we employ the Fiedler value -- the second eigenvalue of the hypergraph Laplacian -- to measure changes in market connectivity and systemic stability. Our event study methodology reveals that FOMC announcements significantly alter network structure across multiple time horizons. Analysis of policy tone, classified using natural language processing, reveals heterogeneous effects: hawkish announcements induce network fragmentation at short horizons ($k=6$) followed by reconsolidation at medium horizons ($k=14$), while neutral statements show limited immediate impact but exhibit delayed fragmentation. These findings suggest that monetary policy communication affects market architecture through a network structural transmission, with effects varying by announcement timing and policy stance.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.02705
  15. By: Avinash Kumar Singh; Bhaskarjit Sarmah; Stefano Pasquali
    Abstract: Text-to-SQL, the task of translating natural language questions into SQL queries, has long been a central challenge in NLP. While progress has been significant, applying it to the financial domain remains especially difficult due to complex schema, domain-specific terminology, and high stakes of error. Despite this, there is no dedicated large-scale financial dataset to advance research, creating a critical gap. To address this, we introduce a curated financial dataset (FINCH) comprising 292 tables and 75, 725 natural language-SQL pairs, enabling both fine-tuning and rigorous evaluation. Building on this resource, we benchmark reasoning models and language models of varying scales, providing a systematic analysis of their strengths and limitations in financial Text-to-SQL tasks. Finally, we propose a finance-oriented evaluation metric (FINCH Score) that captures nuances overlooked by existing measures, offering a more faithful assessment of model performance.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.01887
  16. By: Aadi Singhi
    Abstract: This paper presents a Multi Agent Bitcoin Trading system that utilizes Large Lan- guage Models (LLMs) for alpha generation and portfolio management in the cryptocur- rencies market. Unlike equities, cryptocurrencies exhibit extreme volatility and are heavily influenced by rapidly shifting market sentiments and regulatory announcements, making them difficult to model using static regression models or neural networks trained solely on historical data [53]. The proposed framework overcomes this by structuring LLMs into specialised agents for technical analysis, sentiment evaluation, decision-making, and performance reflection. The system improves over time through a novel verbal feedback mechanism where a Reflect agent provides daily and weekly natural-language critiques of trading decisions. These textual evaluations are then injected into future prompts, al- lowing the system to adjust indicator priorities, sentiment weights, and allocation logic without parameter updates or finetuning. Back-testing on Bitcoin price data from July 2024 to April 2025 shows consistent outperformance across market regimes: the Quantita- tive agent delivered over 30% higher returns in bullish phases and 15% overall gains versus buy-and-hold, while the sentiment-driven agent turned sideways markets from a small loss into a gain of over 100%. Adding weekly feedback further improved total performance by 31% and reduced bearish losses by 10%. The results demonstrate that verbal feedback represents a new, scalable, and low-cost method of tuning LLMs for financial goals.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.08068
  17. By: Anne Lundgaard Hansen; Seung Jung Lee
    Abstract: This paper investigates the impact of the adoption of generative AI on financial stability. We conduct laboratory-style experiments using large language models to replicate classic studies on herd behavior in trading decisions. Our results show that AI agents make more rational decisions than humans, relying predominantly on private information over market trends. Increased reliance on AI-powered trading advice could therefore potentially lead to fewer asset price bubbles arising from animal spirits that trade by following the herd. However, exploring variations in the experimental settings reveals that AI agents can be induced to herd optimally when explicitly guided to make profit-maximizing decisions. While optimal herding improves market discipline, this behavior still carries potential implications for financial stability. In other experimental variations, we show that AI agents are not purely algorithmic, but have inherited some elements of human conditioning and bias.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.01451
  18. By: Sid Ghatak; Arman Khaledian; Navid Parvini; Nariman Khaledian
    Abstract: There are inefficiencies in financial markets, with unexploited patterns in price, volume, and cross-sectional relationships. While many approaches use large-scale transformers, we take a domain-focused path: feed-forward and recurrent networks with curated features to capture subtle regularities in noisy financial data. This smaller-footprint design is computationally lean and reliable under low signal-to-noise, crucial for daily production at scale. At Increase Alpha, we built a deep-learning framework that maps over 800 U.S. equities into daily directional signals with minimal computational overhead. The purpose of this paper is twofold. First, we outline the general overview of the predictive model without disclosing its core underlying concepts. Second, we evaluate its real-time performance through transparent, industry standard metrics. Forecast accuracy is benchmarked against both naive baselines and macro indicators. The performance outcomes are summarized via cumulative returns, annualized Sharpe ratio, and maximum drawdown. The best portfolio combination using our signals provides a low-risk, continuous stream of returns with a Sharpe ratio of more than 2.5, maximum drawdown of around 3\%, and a near-zero correlation with the S\&P 500 market benchmark. We also compare the model's performance through different market regimes, such as the recent volatile movements of the US equity market in the beginning of 2025. Our analysis showcases the robustness of the model and significantly stable performance during these volatile periods. Collectively, these findings show that market inefficiencies can be systematically harvested with modest computational overhead if the right variables are considered. This report will emphasize the potential of traditional deep learning frameworks for generating an AI-driven edge in the financial market.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.16707
  19. By: Patrick J.F. Groenen; Michael Greenacre
    Abstract: The use of kernels for nonlinear prediction is widespread in machine learning. They have been popularized in support vector machines and used in kernel ridge regression, amongst others. Kernel methods share three aspects. First, instead of the original matrix of predictor variables or features, each observation is mapped into an enlarged feature space. Second, a ridge penalty term is used to shrink the coefficients on the features in the enlarged feature space. Third, the solution is not obtained in this enlarged feature space, but through solving a dual problem in the observation space. A major drawback in the present use of kernels is that the interpretation in terms of the original features is lost. In this paper, we argue that in the case of a wide matrix of features, where there are more features than observations, the kernel solution can be re-expressed in terms of a linear combination of the original matrix of features and a ridge penalty that involves a special metric. Consequently, the exact same predicted values can be obtained as a weighted linear combination of the features in the usual manner and thus can be interpreted. In the case where the number of features is less than the number of observations, we discuss a least-squares approximation of the kernel matrix that still allows the interpretation in terms of a linear combination. It is shown that these results hold for any function of a linear combination that minimizes the coefficients and has a ridge penalty on these coefficients, such as in kernel logistic regression and kernel Poisson regression. This work makes a contribution to interpretable artificial intelligence.
    JEL: C19 C88
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:upf:upfgen:1915

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.