nep-ain New Economics Papers
on Artificial Intelligence
Issue of 2025–10–13
nineteen papers chosen by
Ben Greiner, Wirtschaftsuniversität Wien


  1. Can GenAI Improve Academic Performance? Evidence from the Social and Behavioral Sciences By Dragan Filimonovic; Christian Rutzer; Conny Wunsch
  2. The Impact of AI and Digital Platforms on the Information Ecosystem By Joseph E. Stiglitz; Maxim Ventura-Bolet
  3. Genius on Demand: The Value of Transformative Artificial Intelligence By Ajay K. Agrawal; Joshua S. Gans; Avi Goldfarb
  4. When Machines Meet Each Other: Network Effects and the Strategic Role of History in Multi-Agent AI By Yu Liu; Wenwen Li; Yifan Dou; Guangnan Ye
  5. Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making By Ziv Ben-Zion; Zohar Elyoseph; Tobias Spiller; Teddy Lazebnik
  6. How human is the machine? Evidence from 66, 000 Conversations with Large Language Models By Antonios Stamatogiannakis; Arsham Ghodsinia; Sepehr Etminanrad; Dilney Gon\c{c}alves; David Santos
  7. Platform-Enabled Algorithmic Pricing By Shota Ichihashi
  8. AI and jobs. A review of theory, estimates, and evidence By R. Maria del Rio-Chanona; Ekkehard Ernst; Rossana Merola; Daniel Samaan; Ole Teutloff
  9. The AI Productivity Index (APEX) By Bertie Vidgen; Abby Fennelly; Evan Pinnix; Chirag Mahapatra; Zach Richards; Austin Bridges; Calix Huang; Ben Hunsberger; Fez Zafar; Brendan Foody; Dominic Barton; Cass R. Sunstein; Eric Topol; Osvald Nitski
  10. Making AI Count: The Next Measurement Frontier By Diane Coyle; John Lourenze S. Poquiz
  11. Artificial intelligence as a complement to other innovation activities and as a method of invention By Arenas Díaz, Guillermo; Piva, Mariacristina; Vivarelli, Marco
  12. Parsing the pulse: decomposing macroeconomic sentiment with LLMs By Byeungchun Kwon; Taejin Park; Phurichai Rungcharoenkitkul; Frank Smets
  13. Financial Stability Implications of Generative AI: Taming the Animal Spirits By Anne Lundgaard Hansen; Seung Jung Lee
  14. An Artificial Intelligence Value at Risk Approach: Metrics and Models By Luis Enriquez Alvarez
  15. FinReflectKG - EvalBench: Benchmarking Financial KG with Multi-Dimensional Evaluation By Fabrizio Dimino; Abhinav Arun; Bhaskarjit Sarmah; Stefano Pasquali
  16. Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models By Fabrizio Dimino; Krati Saxena; Bhaskarjit Sarmah; Stefano Pasquali
  17. The Effect of AI Investment Announcements on Adopting Companies Abnormal Returns : A Critical Analysis of the UK Market By Kurter, Zeynep O.; Bhatti, Balaaj
  18. Mamba Outpaces Reformer in Stock Prediction with Sentiments from Top Ten LLMs By Lokesh Antony Kadiyala; Amir Mirzaeinia
  19. Can language models boost the power of randomized experiments without statistical bias? By Xinrui Ruan; Xinwei Ma; Yingfei Wang; Waverly Wei; Jingshen Wang

  1. By: Dragan Filimonovic; Christian Rutzer; Conny Wunsch
    Abstract: This paper estimates the effect of Generative AI (GenAI) adoption on scientific productivity and quality in the social and behavioral sciences. Using matched author-level panel data and a difference-in-differences design, we find that GenAI adoption is associated with sizable increases in research productivity, measured by the number of published papers. It also leads to moderate gains in publication quality, based on journal impact factors. These effects are most pronounced among early-career researchers, authors working in technically complex subfields, and those from non-English-speaking countries. The results suggest that GenAI tools may help lower some structural barriers in academic publishing and promote more inclusive participation in research.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.02408
  2. By: Joseph E. Stiglitz; Maxim Ventura-Bolet
    Abstract: We develop a tractable model to study how AI and digital platforms impact the information ecosystem. News producers — who create truthful or untruthful content that becomes a public good or bad — earn revenue from consumer visits. Consumers search for information and differ in their ability to distinguish truthful from untruthful information. AI and digital platforms influence the ecosystem by: improving the efficiency of processing and transmission of information, endangering the producer business model, changing the relative cost of producing misinformation and altering the ability of consumers to screen quality. We find that in the absence of adequate regulation (accountability, content moderation, and intellectual property protection) the quality of the information ecosystem may decline, both because the equilibrium quantity of truthful information declines and the share of misinformation increases; and polarization may intensify. While some of these problems are already evident with digital platforms, AI may have different, and overall more adverse, impacts.
    JEL: D8 D83 O33
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34318
  3. By: Ajay K. Agrawal; Joshua S. Gans; Avi Goldfarb
    Abstract: This paper examines how the emergence of transformative AI systems providing ``genius on demand" would affect knowledge worker allocation and labour market outcomes. We develop a simple model distinguishing between routine knowledge workers, who can only apply existing knowledge with some uncertainty, and genius workers, who create new knowledge at a cost increasing with distance from a known point. When genius capacity is scarce, we find it should be allocated primarily to questions at domain boundaries rather than at midpoints between known answers. The introduction of AI geniuses fundamentally transforms this allocation. In the short run, human geniuses specialise in questions that are furthest from existing knowledge, where their comparative advantage over AI is greatest. In the long run, routine workers may be completely displaced if AI efficiency approaches human genius efficiency.
    JEL: D24 J24 O33
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34316
  4. By: Yu Liu; Wenwen Li; Yifan Dou; Guangnan Ye
    Abstract: As artificial intelligence (AI) enters the agentic era, large language models (LLMs) are increasingly deployed as autonomous agents that interact with one another rather than operate in isolation. This shift raises a fundamental question: how do machine agents behave in interdependent environments where outcomes depend not only on their own choices but also on the coordinated expectations of peers? To address this question, we study LLM agents in a canonical network-effect game, where economic theory predicts convergence to a fulfilled expectation equilibrium (FEE). We design an experimental framework in which 50 heterogeneous GPT-5-based agents repeatedly interact under systematically varied network-effect strengths, price trajectories, and decision-history lengths. The results reveal that LLM agents systematically diverge from FEE: they underestimate participation at low prices, overestimate at high prices, and sustain persistent dispersion. Crucially, the way history is structured emerges as a design lever. Simple monotonic histories-where past outcomes follow a steady upward or downward trend-help stabilize coordination, whereas nonmonotonic histories amplify divergence and path dependence. Regression analyses at the individual level further show that price is the dominant driver of deviation, history moderates this effect, and network effects amplify contextual distortions. Together, these findings advance machine behavior research by providing the first systematic evidence on multi-agent AI systems under network effects and offer guidance for configuring such systems in practice.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.06903
  5. By: Ziv Ben-Zion; Zohar Elyoseph; Tobias Spiller; Teddy Lazebnik
    Abstract: Large language models (LLMs) are rapidly evolving from text generators to autonomous agents, raising urgent questions about their reliability in real-world contexts. Stress and anxiety are well known to bias human decision-making, particularly in consumer choices. Here, we tested whether LLM agents exhibit analogous vulnerabilities. Three advanced models (ChatGPT-5, Gemini 2.5, Claude 3.5-Sonnet) performed a grocery shopping task under budget constraints (24, 54, 108 USD), before and after exposure to anxiety-inducing traumatic narratives. Across 2, 250 runs, traumatic prompts consistently reduced the nutritional quality of shopping baskets (Change in Basket Health Scores of -0.081 to -0.126; all pFDR
    Date: 2025–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.06222
  6. By: Antonios Stamatogiannakis; Arsham Ghodsinia; Sepehr Etminanrad; Dilney Gon\c{c}alves; David Santos
    Abstract: When Artificial Intelligence (AI) is used to replace consumers (e.g., synthetic data), it is often assumed that AI emulates established consumers, and more generally human behaviors. Ten experiments with Large Language Models (LLMs) investigate if this is true in the domain of well-documented biases and heuristics. Across studies we observe four distinct types of deviations from human-like behavior. First, in some cases, LLMs reduce or correct biases observed in humans. Second, in other cases, LLMs amplify these same biases. Third, and perhaps most intriguingly, LLMs sometimes exhibit biases opposite to those found in humans. Fourth, LLMs' responses to the same (or similar) prompts tend to be inconsistent (a) within the same model after a time delay, (b) across models, and (c) among independent research studies. Such inconsistencies can be uncharacteristic of humans and suggest that, at least at one point, LLMs' responses differed from humans. Overall, unhuman-like responses are problematic when LLMs are used to mimic or predict consumer behavior. These findings complement research on synthetic consumer data by showing that sources of bias are not necessarily human-centric. They also contribute to the debate about the tasks for which consumers, and more generally humans, can be replaced by AI.
    Date: 2025–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.07321
  7. By: Shota Ichihashi (Department of Economics, Queen's University, Kingston, ON, Canada)
    Abstract: I study a model of platform-enabled algorithmic pricing. Sellers offer identical products, to which consumers have heterogeneous values. Sellers can post a uniform price outside the platform or join the platform and delegate their pricing decision to the platform's algorithm. I show that the platform can offer a pricing algorithm to attract sellers, stifle off-platform competition, and earn a positive profit. Prohibiting the platform from using consumer data for its algorithm increases consumer surplus but decreases total surplus. A transparency requirement, which mandates the platform to share its data and algorithms with sellers, restores the first-best outcome for consumers.
    Keywords: price discrimination, algorithmic pricing, competition, collusion, algorithm
    JEL: D43
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:net:wpaper:2503
  8. By: R. Maria del Rio-Chanona; Ekkehard Ernst; Rossana Merola; Daniel Samaan; Ole Teutloff
    Abstract: Generative AI is altering work processes, task composition, and organizational design, yet its effects on employment and the macroeconomy remain unresolved. In this review, we synthesize theory and empirical evidence at three levels. First, we trace the evolution from aggregate production frameworks to task- and expertise-based models. Second, we quantitatively review and compare (ex-ante) AI exposure measures of occupations from multiple studies and find convergence towards high-wage jobs. Third, we assemble ex-post evidence of AI's impact on employment from randomized controlled trials (RCTs), field experiments, and digital trace data (e.g., online labor platforms, software repositories), complemented by partial coverage of surveys. Across the reviewed studies, productivity gains are sizable but context-dependent: on the order of 20 to 60 percent in controlled RCTs, and 15 to 30 percent in field experiments. Novice workers tend to benefit more from LLMs in simple tasks. Across complex tasks, evidence is mixed on whether low or high-skilled workers benefit more. Digital trace data show substitution between humans and machines in writing and translation alongside rising demand for AI, with mild evidence of declining demand for novice workers. A more substantial decrease in demand for novice jobs across AI complementary work emerges from recent studies using surveys, platform payment records, or administrative data. Research gaps include the focus on simple tasks in experiments, the limited diversity of LLMs studied, and technology-centric AI exposure measures that overlook adoption dynamics and whether exposure translates into substitution, productivity gains, erode or increase expertise.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.15265
  9. By: Bertie Vidgen; Abby Fennelly; Evan Pinnix; Chirag Mahapatra; Zach Richards; Austin Bridges; Calix Huang; Ben Hunsberger; Fez Zafar; Brendan Foody; Dominic Barton; Cass R. Sunstein; Eric Topol; Osvald Nitski
    Abstract: We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models' ability to produce economically valuable work.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.25721
  10. By: Diane Coyle; John Lourenze S. Poquiz
    Abstract: Generative AI is transforming production, consumption, and work, yet current statistical frameworks would likely struggle to capture its economic full economic impact. While the 2025 System of National Accounts introduces AI as a distinct asset, challenges remain in valuing AI-related investments, inputs, and outputs. Moreover, as a general-purpose technology, AI alters business processes, service quality, and labor organization in ways poorly reflected in official data. This paper outlines key measurement gaps from transformative AI, including the tracking cross-border inputs, quality change, and process changes. We argue that economic statistics should adopt more granular, task-based, and outcome-focused approaches to ensure relevance in an increasingly AI-driven economy.
    JEL: C80 E01 E22 O3 O47
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34330
  11. By: Arenas Díaz, Guillermo; Piva, Mariacristina; Vivarelli, Marco
    Abstract: This study investigates the relationship between Artificial Intelligence (AI) and innovation inputs in Spanish manufacturing firms. While AI is increasingly recognized as a driver of productivity and economic growth, its role in shaping firms’ innovation strategies remains underexplored. Using firm-level data, our analysis focuses on whether AI complements innovation inputs - specifically R&D and Embodied Technological Change (ETC) - and whether AI can be considered as a Method of Invention, able to trigger subsequent innovation investments. Results show a positive association between AI adoption and both internal R&D and ETC, in a static and a dynamic framework. Furthermore, empirical evidence also highlights heterogeneity, with important peculiarities affecting large vs small firms and high-tech vs low-tech companies. These findings suggest that AI may act as both a complement and a catalyst, depending on firm characteristics.
    JEL: O31 O32
    Date: 2025–10–03
    URL: https://d.repec.org/n?u=RePEc:unm:unumer:2025022
  12. By: Byeungchun Kwon; Taejin Park; Phurichai Rungcharoenkitkul; Frank Smets
    Abstract: Macroeconomic indicators provide quantitative signals that must be pieced together and interpreted by economists. We propose a reversed approach of parsing press narratives directly using Large Language Models (LLM) to recover growth and inflation sentiment indices. A key advantage of this LLM-based approach is the ability to decompose aggregate sentiment into its drivers, readily enabling an interpretation of macroeconomic dynamics. Our sentiment indices track hard-data counterparts closely, providing an accurate, near real-time picture of the macroeconomy. Their components–demand, supply, and deeper structural forces–are intuitive and consistent with prior model-based studies. Incorporating sentiment indices improves the forecasting performance of simple statistical models, pointing to information unspanned by traditional data.
    Keywords: macroeconomic sentiment, growth, inflation, monetary policy, fiscal policy, LLMs, machine learning
    JEL: E30 E44 E60 C55 C82
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:bis:biswps:1294
  13. By: Anne Lundgaard Hansen; Seung Jung Lee
    Abstract: This paper investigates the impact of the adoption of generative AI on financial stability. We conduct laboratory-style experiments using large language models to replicate classic studies on herd behavior in trading decisions. Our results show that AI agents make more rational decisions than humans, relying predominantly on private information over market trends. Increased reliance on AI-powered trading advice could therefore potentially lead to fewer asset price bubbles arising from animal spirits that trade by following the herd. However, exploring variations in the experimental settings reveals that AI agents can be induced to herd optimally when explicitly guided to make profit-maximizing decisions. While optimal herding improves market discipline, this behavior still carries potential implications for financial stability. In other experimental variations, we show that AI agents are not purely algorithmic, but have inherited some elements of human conditioning and bias.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.01451
  14. By: Luis Enriquez Alvarez
    Abstract: Artificial intelligence risks are multidimensional in nature, as the same risk scenarios may have legal, operational, and financial risk dimensions. With the emergence of new AI regulations, the state of the art of artificial intelligence risk management seems to be highly immature due to upcoming AI regulations. Despite the appearance of several methodologies and generic criteria, it is rare to find guidelines with real implementation value, considering that the most important issue is customizing artificial intelligence risk metrics and risk models for specific AI risk scenarios. Furthermore, the financial departments, legal departments and Government Risk Compliance teams seem to remain unaware of many technical aspects of AI systems, in which data scientists and AI engineers emerge as the most appropriate implementers. It is crucial to decompose the problem of artificial intelligence risk in several dimensions: data protection, fairness, accuracy, robustness, and information security. Consequently, the main task is developing adequate metrics and risk models that manage to reduce uncertainty for decision-making in order to take informed decisions concerning the risk management of AI systems. The purpose of this paper is to orientate AI stakeholders about the depths of AI risk management. Although it is not extremely technical, it requires a basic knowledge of risk management, quantifying uncertainty, the FAIR model, machine learning, large language models and AI context engineering. The examples presented pretend to be very basic and understandable, providing simple ideas that can be developed regarding specific AI customized environments. There are many issues to solve in AI risk management, and this paper will present a holistic overview of the inter-dependencies of AI risks, and how to model them together, within risk scenarios.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.18394
  15. By: Fabrizio Dimino; Abhinav Arun; Bhaskarjit Sarmah; Stefano Pasquali
    Abstract: Large language models (LLMs) are increasingly being used to extract structured knowledge from unstructured financial text. Although prior studies have explored various extraction methods, there is no universal benchmark or unified evaluation framework for the construction of financial knowledge graphs (KG). We introduce FinReflectKG - EvalBench, a benchmark and evaluation framework for KG extraction from SEC 10-K filings. Building on the agentic and holistic evaluation principles of FinReflectKG - a financial KG linking audited triples to source chunks from S&P 100 filings and supporting single-pass, multi-pass, and reflection-agent-based extraction modes - EvalBench implements a deterministic commit-then-justify judging protocol with explicit bias controls, mitigating position effects, leniency, verbosity and world-knowledge reliance. Each candidate triple is evaluated with binary judgments of faithfulness, precision, and relevance, while comprehensiveness is assessed on a three-level ordinal scale (good, partial, bad) at the chunk level. Our findings suggest that, when equipped with explicit bias controls, LLM-as-Judge protocols provide a reliable and cost-efficient alternative to human annotation, while also enabling structured error analysis. Reflection-based extraction emerges as the superior approach, achieving best performance in comprehensiveness, precision, and relevance, while single-pass extraction maintains the highest faithfulness. By aggregating these complementary dimensions, FinReflectKG - EvalBench enables fine-grained benchmarking and bias-aware evaluation, advancing transparency and governance in financial AI applications.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.05710
  16. By: Fabrizio Dimino; Krati Saxena; Bhaskarjit Sarmah; Stefano Pasquali
    Abstract: Large Language Models are increasingly adopted in financial applications to support investment workflows. However, prior studies have seldom examined how these models reflect biases related to firm size, sector, or financial characteristics, which can significantly impact decision-making. This paper addresses this gap by focusing on representation bias in open-source Qwen models. We propose a balanced round-robin prompting method over approximately 150 U.S. equities, applying constrained decoding and token-logit aggregation to derive firm-level confidence scores across financial contexts. Using statistical tests and variance analysis, we find that firm size and valuation consistently increase model confidence, while risk factors tend to decrease it. Confidence varies significantly across sectors, with the Technology sector showing the greatest variability. When models are prompted for specific financial categories, their confidence rankings best align with fundamental data, moderately with technical signals, and least with growth indicators. These results highlight representation bias in Qwen models and motivate sector-aware calibration and category-conditioned evaluation protocols for safe and fair financial LLM deployment.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.05702
  17. By: Kurter, Zeynep O. (University of Warwick; Department of Economics); Bhatti, Balaaj (University of Warwick; Department of Economics)
    Abstract: While artificial intelligence (AI) has become increasingly prevalent, empirical evidence on its impacton firm value is limited. This inaugural UK market study uses event study methodology to assess stock market reactions to AI investment announcements by FTSE 100 companies from 2019-2023. Analysing 138 announcements from 53 companies, the research reveals that AI investments have a marginally positive, but statistically insignificant impact of 0.114% on the announcement day, affirmed by both parametric and non-parametric tests. Further subsample analysis shows that high credit rating firms and early adopters experience significantly negative impacts on firm value, indicating investor risk-aversion and tentative evidence of a second-mover advantage. Crosssectional analysis demonstrates that industry and the type of AI investment critically influence returns, and confirms the size effect with larger firms experiencing more negative returns than smaller ones. Earnings before interest, taxes and amortization (EBITDA) margins and cyber risk ratings, however, do not significantly impact returns. This study advances AI literature by examining market dynamics associated with AI investments, providing a foundation for future research, and providing practical insights for investors and corporate managers aiming to maximize risk-adjusted returns and firm value
    Keywords: Artificial intelligence ; AI ; firm value ; event study ; abnormal returns ; United Kingdom JEL classification: G11 ; G14 ; O33 ; M21 ; L1
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:wrk:warwec:1581
  18. By: Lokesh Antony Kadiyala; Amir Mirzaeinia
    Abstract: The stock market is extremely difficult to predict in the short term due to high market volatility, changes caused by news, and the non-linear nature of the financial time series. This research proposes a novel framework for improving minute-level prediction accuracy using semantic sentiment scores from top ten different large language models (LLMs) combined with minute interval intraday stock price data. We systematically constructed a time-aligned dataset of AAPL news articles and 1-minute Apple Inc. (AAPL) stock prices for the dates of April 4 to May 2, 2025. The sentiment analysis was achieved using the DeepSeek-V3, GPT variants, LLaMA, Claude, Gemini, Qwen, and Mistral models through their APIs. Each article obtained sentiment scores from all ten LLMs, which were scaled to a [0, 1] range and combined with prices and technical indicators like RSI, ROC, and Bollinger Band Width. Two state-of-the-art such as Reformer and Mamba were trained separately on the dataset using the sentiment scores produced by each LLM as input. Hyper parameters were optimized by means of Optuna and were evaluated through a 3-day evaluation period. Reformer had mean squared error (MSE) or the evaluation metrics, and it should be noted that Mamba performed not only faster but also better than Reformer for every LLM across the 10 LLMs tested. Mamba performed best with LLaMA 3.3--70B, with the lowest error of 0.137. While Reformer could capture broader trends within the data, the model appeared to over smooth sudden changes by the LLMs. This study highlights the potential of integrating LLM-based semantic analysis paired with efficient temporal modeling to enhance real-time financial forecasting.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.01203
  19. By: Xinrui Ruan; Xinwei Ma; Yingfei Wang; Waverly Wei; Jingshen Wang
    Abstract: Randomized experiments or randomized controlled trials (RCTs) are gold standards for causal inference, yet cost and sample-size constraints limit power. Meanwhile, modern RCTs routinely collect rich, unstructured data that are highly prognostic of outcomes but rarely used in causal analyses. We introduce CALM (Causal Analysis leveraging Language Models), a statistical framework that integrates large language models (LLMs) predictions with established causal estimators to increase precision while preserving statistical validity. CALM treats LLM outputs as auxiliary prognostic information and corrects their potential bias via a heterogeneous calibration step that residualizes and optimally reweights predictions. We prove that CALM remains consistent even when LLM predictions are biased and achieves efficiency gains over augmented inverse probability weighting estimators for various causal effects. In particular, CALM develops a few-shot variant that aggregates predictions across randomly sampled demonstration sets. The resulting U-statistic-like predictor restores i.i.d. structure and also mitigates prompt-selection variability. Empirically, in simulations calibrated to a mobile-app depression RCT, CALM delivers lower variance relative to other benchmarking methods, is effective in zero- and few-shot settings, and remains stable across prompt designs. By principled use of LLMs to harness unstructured data and external knowledge learned during pretraining, CALM provides a practical path to more precise causal analyses in RCTs.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.05545

This nep-ain issue is ©2025 by Ben Greiner. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.