nep-ain New Economics Papers
on Artificial Intelligence
Issue of 2026–04–13
fifteen papers chosen by
Ben Greiner, Wirtschaftsuniversität Wien


  1. Using large language models as a source of human behavioral data in social science experiments By van Loon, Austin; Kanopka, Klint
  2. Debiasing LLMs by Fine-tuning By Zhenyu Gao; Wenxi Jiang; Yutong Yan
  3. Guidance Over Adoption: Experimental Evidence on AI-Assisted Learning By Gallegos, Sebastian
  4. Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics By Cristian Espinal Maya
  5. AI Adoption and Firms’ Job-Posting Behavior By Jessica Liu; Douglas A. Webber
  6. Same Storm, Different Boats: Generative AI and the Age Gradient in Hiring By Lodefalk, Magnus; Löthman, Lydia; Koch, Michael; Engberg, Erik
  7. Bounded by Risk, Not Capability: Quantifying AI Occupational Substitution Rates via a Tech-Risk Dual-Factor Model By Shuyao Gao; Minghao Huang
  8. AI and Coder Employment: Compiling the Evidence By Leland D. Crane; Paul E. Soto
  9. How AI Aggregation Affects Knowledge By Daron Acemoglu; Tianyi Lin; Asuman Ozdaglar; James Siderius
  10. The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research By Ning Li
  11. AI Worker Management technologies in traditional industries By Claudia Collodoro; Lucrezia Fanti; Jacopo Staccioli; Maria Enrica Virgillito
  12. AI Patents in the United States and China: Measurement, Organization, and Knowledge Flows By Hanming Fang; Xian Gu; Hanyin Yan; Wu Zhu
  13. Artificial Intelligence and Systemic Risk: A Unified Model of Performative Prediction, Algorithmic Herding, and Cognitive Dependency in Financial Markets By Shuchen Meng; Xupeng Chen
  14. Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets By Jaden Zhang; Gardenia Liu; Oliver Johansson; Hileamlak Yitayew; Kamryn Ohly; Grace Li
  15. Processing Power: The Effect of Data Centers on Wholesale Electricity Markets By Owen Kay; Robert Reaser; Reid Taylor

  1. By: van Loon, Austin; Kanopka, Klint (New York University)
    Abstract: Large language models (LLMs) have prompted proposals to replace human subjects in social science experiments with simulated responses. Empirical evaluations suggest that this practice---often called silicon sampling---can sometimes approximate human behavior but is unreliable. We delineate where this approach may still provide value and where it may not, but primarily study an alternative approach: one in which model-based predictions are used not as substitutes for human data, but as auxiliary measurements within randomized experiments. We formalize the inference of causal estimands from mixed-subjects randomized controlled trials, in which outcomes are observed for a subset of units while predictions are available for all units. Under transparent design conditions, we derive a family of estimators that remain unbiased for the average treatment effect in finite samples while exploiting predictions to reduce variance. We characterize when prediction-powered, calibration-based, arm-specifically tuned, and difference-in-predictions estimators improve precision, and we provide a software package which operationalizes these results and aids researchers to jointly select estimators and allocate budgets between human data collection and prediction generation. Together, our results show how generative artificial intelligence can improve experimental social science without compromising scientific validity.
    Date: 2026–04–03
    URL: https://d.repec.org/n?u=RePEc:osf:socarx:y74mu_v1
  2. By: Zhenyu Gao; Wenxi Jiang; Yutong Yan
    Abstract: Prior research shows that large language models (LLMs) exhibit systematic extrapolation bias when forming predictions from both experimental and real-world data, and that prompt-based approaches appear limited in alleviating this bias. We propose a supervised fine-tuning (SFT) approach that uses Low-Rank Adaptation (LoRA) to train off-the-shelf LLMs on instruction datasets constructed from rational benchmark forecasts. By intervening at the parameter level, SFT changes how LLMs map observed information into forecasts and thereby mitigates extrapolation bias. We evaluate the fine-tuned model in two settings: controlled forecasting experiments and cross-sectional stock return prediction. In both settings, fine-tuning corrects the extrapolative bias out-of-sample, establishing a low-cost and generalizable method for debiasing LLMs.
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.02921
  3. By: Gallegos, Sebastian (Universidad Adolfo Ibañez)
    Abstract: This paper estimates the causal effect of a large language model–based study assistant on student behavior and learning outcomes in a natural field setting with real academic stakes. I design and deploy a course-specific AI assistant (GPT-UAI) for undergraduate econometrics and evaluate it through two randomized interventions implemented across seven coordinated course sections at a selective university in Chile. The first intervention targets the extensive margin of use, encouraging GPT-UAI adoption prior to the midterm exam. The encouragement raises the GPT’s awareness and reported usage, but does not change its perceived value and does not improve midterm performance. The second intervention targets use at the intensive margin, providing guidance on learning-oriented usage for the final exam. Guidance shifts interactions with GPT-UAI toward tutor-style engagement, increases perceived usefulness by 0.38 standard deviations, improves final-exam performance by 0.21 standard deviations, and raises the probability of earning a passing exam grade by 12 percentage points. The findings suggest that learning gains arise less from adoption than from guiding how students use course-specific AI assistants.
    Keywords: generative AI, large language models, higher education, field experiments, randomized controlled trials, student learning, human capital, AI-assisted learning, tutoring, technology in education
    JEL: I23 C93 O33 D83
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp18513
  4. By: Cristian Espinal Maya
    Abstract: This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18, 796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3, 666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.02403
  5. By: Jessica Liu; Douglas A. Webber
    Abstract: Since ChatGPT was launched in November 2022, usage of generative artificial intelligence (AI) has soared. More than half of the U.S. working-age population has used generative AI, according to the Real-Time Population Survey from August 2025 (Bick, Blandin, and Deming 2025).
    Date: 2026–03–27
    URL: https://d.repec.org/n?u=RePEc:fip:fedgfn:102994
  6. By: Lodefalk, Magnus (The Ratio Institute); Löthman, Lydia (The Ratio Institute); Koch, Michael (The Ratio Institute); Engberg, Erik (The Ratio Institute)
    Abstract: We show that the age composition of employment within Swedish employers shifts after the arrival of generative AI, with no corresponding reduction in aggregate labour demand. Using 4.6 million job advertisements from Sweden's largest recruitment platform, we find that the broad decline in postings since 2022 aligns with monetary tightening rather than AI, exploiting Sweden's seven-month gap between the Riksbank's first rate hike and the launch of ChatGPT as a timing test. We then use full-population employer–employee register data and an employer-level difference-in-differences design to estimate how AI exposure affects employment composition across six age groups. An event study documents an accelerating decline in employment of 22–25-year-olds in high-AI-exposure occupations, reaching 5.5 per cent by early 2025 relative to less exposed occupations within the same employers, while employment of workers over 50 rose by 1.3 per cent. The widening age gradient suggests that generative AI reshapes hiring composition rather than aggregate demand, with the adjustment burden falling disproportionately on entry-level workers.
    Keywords: Generative artificial intelligence; Job postings; Labour demand; Employment composition; Monetary policy
    JEL: J23 J24 O33
    Date: 2026–03–16
    URL: https://d.repec.org/n?u=RePEc:hhs:ratioi:0388
  7. By: Shuyao Gao (aSSIST University, Seoul, South Korea); Minghao Huang (aSSIST University, Seoul, South Korea)
    Abstract: The deployment of Large Language Models (LLMs) has ignited concerns about technological unemployment. Existing task-based evaluations predominantly measure theoretical "exposure" to AI capabilities, ignoring critical frictions of real-world commercial adoption: liability, compliance, and physical safety. We argue occupations are not eradicated instantaneously, but gradually encroached upon via atomic actions. We introduce a Tech-Risk Dual-Factor Model to re-evaluate this. By deconstructing 923 occupations into 2, 087 Detailed Work Activities (DWAs), we utilize a multi-agent LLM ensemble to score both technical feasibility and business risk. Through variance-based Human-in-the-Loop (HITL) validation with an expert panel, we demonstrate a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability. Applying a strictly algorithmic baseline via mathematical bottleneck aggregation, we calculate Relative Occupational Automation Indices ($OAI$) for the U.S. labor market. Our findings challenge the traditional Routine-Biased Technological Change (RBTC) hypothesis. Non-routine cognitive roles highly dependent on symbolic manipulation (e.g., Data Scientists) face unprecedented exposure ($OAI \approx 0.70$). Conversely, unstructured physical trades and high-stakes caretaking roles exhibit absolute resilience, quantifying a profound "Cognitive Risk Asymmetry." We hypothesize the emergent necessity of a "Compliance Premium, " indicating wage resilience increasingly tied to risk-absorption capacity. We frame these findings as a cross-sectional diagnostic of systemic vulnerability, establishing a foundation for subsequent Computable General Equilibrium (CGE) econometric modeling involving dynamic wage elasticity and structural labor reallocation.
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.04464
  8. By: Leland D. Crane; Paul E. Soto
    Abstract: We evaluate whether LLMs have had any discernible impact on the aggregate labor market so far. We focus on occupations that are computer programming-intensive, motivated by data showing that coding is one of the most LLM-exposed tasks. Linking O*NET to CPS we find that aggregate employment of coders has decelerated sharply since the introduction of ChatGPT. Using a novel control variable for industry-level shocks we show that the deceleration is not attributable to the exposure of coders to slowing industries, suggesting instead that coders experienced an occupation-specific shock around the introduction of ChatGPT. Coder employment has continued to grow in recent years, though much more slowly than it did pre-2022. We validate the industry-level control variable by examining historical examples of occupations that experienced either occupation-specific or industry-level shocks. We also provide statistics on the agreement rates between different measures of AI exposure.
    Keywords: Labor demand; Machine learning; Shocks
    JEL: J23 J24 O33
    Date: 2026–03–23
    URL: https://d.repec.org/n?u=RePEc:fip:fedgfe:102997
  9. By: Daron Acemoglu; Tianyi Lin; Asuman Ozdaglar; James Siderius
    Abstract: Artificial intelligence (AI) changes social learning when aggregated outputs become training data for future predictions. To study this, we extend the DeGroot model by introducing an AI aggregator that trains on population beliefs and feeds synthesized signals back to agents. We define the learning gap as the deviation of long-run beliefs from the efficient benchmark, allowing us to capture how AI aggregation affects learning. Our main result identifies a threshold in the speed of updating: when the aggregator updates too quickly, there is no positive-measure set of training weights that robustly improves learning across a broad class of environments, whereas such weights exist when updating is sufficiently slow. We then compare global and local architectures. Local aggregators trained on proximate or topic-specific data robustly improve learning in all environments. Consequently, replacing specialized local aggregators with a single global aggregator worsens learning in at least one dimension of the state.
    JEL: D80 D83 D85
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:35036
  10. By: Ning Li
    Abstract: Autonomous AI systems can now generate complete economics research papers, but they substantially underperform human-authored publications in head-to-head comparisons. This paper decomposes the quality gap into two independent components: research idea quality and execution quality. Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality, we analyze 953 economics papers -- 912 AI-generated papers from the APE project and 41 human papers published in the American Economic Review and AEJ: Economic Policy. The idea quality gap is large (Cohen's d = 2.23, p
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.03338
  11. By: Claudia Collodoro (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore, Milano, Italy); Lucrezia Fanti (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore, Milano, Italy - Instituto di Economia, Scuola Superiore Sant’Anna, Pisa, Italy); Jacopo Staccioli (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore, Milano, Italy - Instituto di Economia, Scuola Superiore Sant’Anna, Pisa, Italy); Maria Enrica Virgillito (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore, Milano, Italy - Instituto di Economia, Scuola Superiore Sant’Anna, Pisa, Italy)
    Abstract: This work provides a comprehensive large-scale analysis of artificial intelligence-based worker management (AIWM) systems from an industry-wide exposure perspective focusing on traditional industries. We begin by examining the knowledge production underlying these workforce management tools and leverage technology patent-classification to identify their dynamics and specific features. For this purpose, we use patent data retrieved from Orbis Intellectual Property covering the years 1975 to 2022, considering patents filed with both the EPO and the USPTO. Furthermore, to identify patents related to AIWM heuristics, we retrieve their full text from Google Patents and conduct a textual analysis using a dependency parsing algorithm. Finally, using the dictionary of human tasks provided by O*NET, we construct a measure of exposure to AIWM systems for individual human tasks and occupations. Linking the technological and labour market domains, we find that the professions most exposed to AIWM systems are those at the top of organisational hierarchies.
    Keywords: Artificial Intelligence Worker Management, Sector-level Analysis, Patenting Activity, Techno-organisational Change
    JEL: O14 O33
    Date: 2026–01
    URL: https://d.repec.org/n?u=RePEc:ctc:serie5:dipe0056
  12. By: Hanming Fang; Xian Gu; Hanyin Yan; Wu Zhu
    Abstract: We develop a high-precision classifier to measure artificial intelligence (AI) patents by fine-tuning PatentSBERTa on manually labeled data from the USPTO’s AI Patent Dataset. Our classifier substantially improves the existing USPTO approach, achieving 97.0% precision, 91.3% recall, and a 94.0% F1 score, and it generalizes well to Chinese patents based on citation and lexical validation. Applying it to granted U.S. patents (1976–2023) and Chinese patents (2010–2023), we document rapid growth in AI patenting in both countries and broad convergence in AI patenting intensity and subfield composition, even as China surpasses the United States in recent annual patent counts. The organization of AI innovation nevertheless differs sharply: U.S. AI patenting is concentrated among large private incumbents and established hubs, whereas Chinese AI patenting is more geographically diffuse and institutionally diverse, with larger roles for universities and state-owned enterprises. For listed firms, AI patents command a robust market-value premium in both countries. Cross-border citations show continued technological interdependence rather than decoupling, with Chinese AI inventors relying more heavily on U.S. frontier knowledge than vice versa.
    JEL: C55 G14 O31 O33 O34 O57
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:35022
  13. By: Shuchen Meng; Xupeng Chen
    Abstract: We develop a unified model in which AI adoption in financial markets generates systemic risk through three mutually reinforcing channels: performative prediction, algorithmic herding, and cognitive dependency. Within an extended rational expectations framework with endogenous adoption, we derive an equilibrium systemic risk coupling $r(\phi) = \phi\rho\beta/\lambda'(\phi)$, where $\phi$ is the AI adoption share, $\rho$ the algorithmic signal correlation, $\beta$ the performative feedback intensity, and $\lambda'(\phi)$ the endogenous effective price impact. Because $\lambda'(\phi)$ is decreasing in $\phi$, the coupling is convex in adoption, implying that the systemic risk multiplier $M = (1 - r)^{-1}$ grows superlinearly as AI penetration increases. The model is developed in three layers. First, endogenous fragility: market depth is decreasing and convex in AI adoption. Second, embedding the convex coupling within a supermodular adoption game produces a saddle-node bifurcation into an algorithmic monoculture. Third, cognitive dependency as an endogenous state variable yields an impossibility theorem (hysteresis requires dynamics beyond static frameworks) and a channel necessity theorem (each channel is individually necessary). Empirical validation uses the complete universe of SEC Form 13F filings (99.5 million holdings, 10, 957 institutional managers, 2013--2024) with a Bartik shift-share instrument (first-stage $F = 22.7$). The model implies tail-loss amplification of 18--54%, economically significant relative to Basel III countercyclical buffers.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.03272
  14. By: Jaden Zhang; Gardenia Liu; Oliver Johansson; Hileamlak Yitayew; Kamryn Ohly; Grace Li
    Abstract: We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests models in environments where trades execute on actual exchanges (Kalshi and Polymarket), providing objective ground truth that cannot be gamed or overfitted. Each model operates as an independent agent starting with $10, 000, making autonomous decisions every 15-45 minutes. Over a 57-day longitudinal evaluation (January 12 to March 9, 2026), we track two cohorts: six frontier models in live trading (Cohort 1, full period) and four next-generation models in paper trading (Cohort 2, 3-day preliminary). For Cohort 1, final Kalshi returns range from -16.0% to -30.8%. Our analysis identifies a clear performance hierarchy: initial prediction accuracy and the ability to capitalize on correct predictions are the main drivers, while research volume shows no correlation with outcomes. A striking cross-platform contrast emerges from parallel Polymarket live trading: Cohort 1 models averaged only -1.1% on Polymarket vs. -22.6% on Kalshi, with grok-4-20-checkpoint achieving a 71.4% settlement win rate - the highest across any platform or cohort. gemini-3.1-pro-preview (Cohort 2), which executed zero trades on Kalshi, achieved +6.02% on Polymarket in 3 days - the best return of any model across either cohort - demonstrating that platform design has a profound effect on which models succeed. Beyond performance, we analyze computational efficiency (token usage, cycle time), settlement accuracy, exit patterns, and market preferences, providing a comprehensive view of how frontier models behave under real financial pressure.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.07355
  15. By: Owen Kay; Robert Reaser; Reid Taylor
    Abstract: Artificial-intelligence-driven data centers are reversing two decades of flat U.S. electricity demand and have generated questions about how this growth will impact electricity prices. We quantify this effect using an hourly, unit-level least-cost dispatch model covering wholesale electricity markets in the continental United States. We find that existing data centers have already increased wholesale prices by 3 to 5% on average nationwide, with substantially larger effects in regions hosting major data center corridors. Extending the model through 2028, we show that if proposed construction proceeds under high-utilization scenarios, wholesale prices could rise dramatically (50%), while more moderate build-out yields smaller (20%) but still meaningful effects. Impacts vary due to utilization and build-out assumptions. Finally, we use the model to address several policy discussions including optimal data center siting decisions and renewable build-out uncertainty.
    Keywords: electricity prices; energy; data centers; artificial intelligence
    JEL: L94 P18 Q41 Q42 Q48
    Date: 2026–03–20
    URL: https://d.repec.org/n?u=RePEc:fip:feddwp:102959

This nep-ain issue is ©2026 by Ben Greiner. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.