nep-ain 2026-03-09 papers

on Artificial Intelligence

Issue of 2026–03–09
sixteen papers chosen by
Ben Greiner, Wirtschaftsuniversität Wien

When Algorithms Rate Performance: Do Large Language Models Replicate Human Evaluation Biases? By Rilke, Rainer; Sliwka, Dirk
The Worth of a “Wo”: Gender Bias in Financial Advice from LLMs By Foltyn, Richard; Olsson, Jonna
How does AI distribute the pie? Large Language Models and the Ultimatum Game By Douglas K.G. Araujo; Harald Uhlig
Does AI Cheapen Talk? Theory and Evidence From Global Entrepreneurship and Hiring By Bo Cowgill; Pablo Hernandez-Lagos; Nataliya Langburd Wright
A Bayesian Framework for Human-AI Collaboration: Complementarity and Correlation Neglect By Saurabh Amin; Amine Bennouna; Daniel Huttenlocher; Dingwen Kong; Liang Lyu; Asuman Ozdaglar
LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets By Aidan Vyas
Generative AI and Career Choices By Christian Gschwendt; Martina Viarengo; Thea S. Zoellner
Janus-Faced Technological Progress and the Arms Race in the Education of Humans and Chatbots By Wolfgang Kuhle
AI Investment and Economic Growth: Exploring the Divergence of Equilibria By Nobuyasu Suzusho
Generative AI, Productivity and the Future of Work By Greg Cancelada
The Global Trade Effects of the AI Infrastructure Boom By Francois de Soyres; Alex Haag; Mike Liu; Eva Van Leemput
How Effectively Can Current LLMs Analyze Macrofinancial Issues? By Paola Ganum; Tohid Atashbar
Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models? By Wenxi Geng; Dingyuan Liu; Liya Li; Yiqing Wang
AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models By Wentao Zhang; Mingxuan Zhao; Jincheng Gao; Jieshun You; Huaiyu Jia; Yilei Zhao; Bo An; Shuo Sun
Measuring Online Media Ideology with Large Language Models and "Multi-Cue Classification" By da Silva, Lucas Paulo
"Artificial Intelligence: Friend, Foe, Fraud" By L. Randall Wray

When Algorithms Rate Performance: Do Large Language Models Replicate Human Evaluation Biases?

By:	Rilke, Rainer (WHU - Otto Beisheim School of Management); Sliwka, Dirk (University of Cologne)
Abstract:	A large body of research across management, psychology, accounting, and economics shows that subjective performance evaluations are systematically biased: ratings cluster near the midpoint of scales and are often excessively lenient. As organizations increasingly adopt large language models (LLMs) for evaluative tasks, little is known about how these systems perform when assessing human performance. We document that, in the absence of clear objective standards and when individuals are rated independently, LLMs reproduce the familiar patterns of human raters. However, LLMs generate greater dispersion and accuracy when evaluating multiple individuals simultaneously. With noisy but objective performance signals, LLMs provide substantially more accurate evaluations than human raters, as they (i) are less subject to biases arising from concern for the evaluated employee and (ii) make fewer mistakes in information processing closely approximating rational Bayesian benchmarks.
Keywords:	performance evaluation, large language models, signal objectivity, algorithmic judgment, Gen-AI
JEL:	J24 J28 M12 M53
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:iza:izadps:dp18371

The Worth of a “Wo”: Gender Bias in Financial Advice from LLMs

By:	Foltyn, Richard (Dept. of Economics, Norwegian School of Economics and Business Administration); Olsson, Jonna (Dept. of Economics, Norwegian School of Economics and Business Administration)
Abstract:	Do large language models (LLMs) provide gender-neutral financial advice? We answer this question by prompting 33 widely used LLMs from five vendors, varying only a single word in otherwise identical prompts: “man” versus “woman.” We find that women are advised to allocate 1.8 percentage points less to equity funds than men; this gap persists across vendors, model generations, and model complexity. Providing richer investor information attenuates but does not entirely eliminate the gender gap. Since even modest allocation differences imply persistent return differentials, algorithmic financial advice can shape wealth accumulation across demographic groups.
Keywords:	Algorithmic bias; Gender bias; Large Language Models; Portfolio allocation
JEL:	C01 G11 J16
Date:	2026–02–27
URL:	https://d.repec.org/n?u=RePEc:hhs:nhheco:2026_004

How does AI distribute the pie? Large Language Models and the Ultimatum Game

By:	Douglas K.G. Araujo (Banco Central do Brasil); Harald Uhlig (University of Chicago, CEPR and NBER)
Abstract:	As Large Language Models (LLMs) are increasingly tasked with autonomous decisionmaking, understanding their behavior in strategic settings is crucial. We investigate the choices of various LLMs in the Ultimatum Game, a setting where human behavior notably deviates from theoretical rationality. We conduct experiments varying the stake size and the nature of the opponent (Human vs. AI) across both Proposer and Responder roles. Three key results emerge. First, LLM behavior is heterogeneous but predictable when conditioning on stake size and player types. Second, while some models approximate the rational benchmark and others mimic human social preferences, a distinct â€œaltruisticâ€ mode emerges where LLMs propose hyper-fair distributions (greater than 50%). Third, LLM Proposers forgo a large share of total payoff, and an even larger share when the Responder is human. These findings highlight the need for careful testing before deploying AI agents in economic settings.
Keywords:	Ultimatum Game, LLM, AI Agents, Behavioral Economics, Algorithmic Decision Making
JEL:	C70 C90 D91
Date:	2026
URL:	https://d.repec.org/n?u=RePEc:bfi:wpaper:2026-29

Does AI Cheapen Talk? Theory and Evidence From Global Entrepreneurship and Hiring

By:	Bo Cowgill; Pablo Hernandez-Lagos; Nataliya Langburd Wright
Abstract:	Screening human capital based on signals such as job applications or entrepreneurial pitches is crucial for organizations. Signals are often informative insofar as they require differential knowledge and effort to produce. Generative AI (GAI) complicates screening by lowering the cost of producing impressive signals. We model the informational effects of GAI, showing that applicants' access to GAI can increase - but also decrease - an evaluator's screening mistakes. This result depends on how GAI affects experts' signals compared to non-experts'. Using experiments in hiring and startup investing, we estimate that senders' access to GAI (ChatGPT) lowers screening accuracy by 4-9% for employers and startup investors. Consistent with our model, senders' access to GAI also improves screening accuracy in some settings - in our case, among senders from non-English-speaking countries. These results show that GAI can profoundly shape screening accuracy.
Keywords:	screening, artificial intelligence, entrepreneurship, human capital
JEL:	D82 M51 L26 D83 O33 M13
Date:	2026
URL:	https://d.repec.org/n?u=RePEc:ces:ceswps:_12508

A Bayesian Framework for Human-AI Collaboration: Complementarity and Correlation Neglect

By:	Saurabh Amin; Amine Bennouna; Daniel Huttenlocher; Dingwen Kong; Liang Lyu; Asuman Ozdaglar
Abstract:	We develop a decision-theoretic model of human-AI interaction to study when AI assistance improves or impairs human decision-making. A human decision-maker observes private information and receives a recommendation from an AI system, but may combine these signals imperfectly. We show that the effect of AI assistance decomposes into two main forces: the marginal informational value of the AI beyond what the human already knows, and a behavioral distortion arising from how the human uses the AI's recommendation. Central to our analysis is a micro-founded measure of informational overlap between human and AI knowledge. We study an empirically relevant form of imperfect decision-making -- correlation neglect -- whereby humans treat AI recommendations as independent of their own information despite shared evidence. Under this model, we characterize how overlap and AI capabilities shape the Human-AI interaction regime between augmentation, impairment, complementarity, and automation, and draw key insights.
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2602.14331

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

By:	Aidan Vyas
Abstract:	We introduce LemonadeBench v0.5, a minimal benchmark for evaluating economic intuition, long-term planning, and decision-making under uncertainty in large language models (LLMs) through a simulated lemonade stand business. Models must manage inventory with expiring goods, set prices, choose operating hours, and maximize profit over a 30-day period-tasks that any small business owner faces daily. All models demonstrate meaningful economic agency by achieving profitability, with performance scaling dramatically by sophistication-from basic models earning minimal profits to frontier models capturing 70% of theoretical optimal, a greater than 10x improvement. Yet our decomposition of business efficiency across six dimensions reveals a consistent pattern: models achieve local rather than global optimization, excelling in select areas while exhibiting surprising blind spots elsewhere.
Date:	2026–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2602.13209

Generative AI and Career Choices

By:	Christian Gschwendt; Martina Viarengo; Thea S. Zoellner
Abstract:	The economic impact of technological change will critically depend on how future workers invest in their human capital. Yet, little is known about how future workers themselves evaluate and choose their educational and occupational paths in light of emerging technologies. This paper examines how adolescents currently at the school-to-work transition stage value working with generative artificial intelligence (GenAI) in their future occupations, and how automation risk and opportunities for continuing education shape these preferences. We field a discrete-choice experiment among a nationally representative sample of over 7, 000 Swiss adolescents aged around 15. We find that adolescents generally exhibit an aversion to collaborating with GenAI at work, with females consistently more averse than males. However, preferences are nuanced: adolescents welcome greater GenAI collaboration, provided that GenAI usage levels remain moderate and that it is not accompanied by increases in job-automation risk. Finally, continuing education opportunities in occupations improve attitudes towards working with GenAI across genders. Our results challenge simple narratives of technology acceptance or rejection, revealing that adolescents' willingness to work with GenAI depends on how it is implemented â€” its intensity, associated displacement risks, and accompanying skill development - rather than the technology itself. Our findings suggest that the way future workers value GenAI collaboration in their career choices critically depends on its intensity and on the interplay with automation risk and AI-related educational opportunities.
Keywords:	occupational choice, gender gaps, GenAI, choice experiment, continuing education, automation risk
JEL:	I24 J24 O33
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:iso:educat:0251

Janus-Faced Technological Progress and the Arms Race in the Education of Humans and Chatbots

By:	Wolfgang Kuhle
Abstract:	We study the conditions under which technological advances, in combination with a lognormal wage distribution, incentivize agents into an inefficient educational arms race. Our model emphasizes that lognormal wage distributions imply that agents' wages increase exponentially in the level of their skill as well as in the level of technology. In turn, this exponential relation between skills, technology, and wages pressures agents into an exhausting race for the tails of the economy's skill distribution. Moreover, technological advances and overinvestment in education increase GDP and inequality, while welfare may decline. In an alternative interpretation, our model studies firms that invest in artificial intelligence of their chatbots and AI agents. For a wide range of specifications, firms, just like humans, have an incentive to choose corner solutions where investment is limited only by borrowing constraints.
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2602.19783

AI Investment and Economic Growth: Exploring the Divergence of Equilibria

By:	Nobuyasu Suzusho (Graduate School of Economics, Kyoto University)
Abstract:	This paper focuses on the role of strategic complementarity between artificial intelligence (AI) investment and human capital accumulation, analyzing how they interact to shape long-term economic growth. Our model introduces an explicit threshold investment level necessary for viable AI adoption and demonstrates how its existence, along with agents' expectations, generates multiple equilibria. The paper concludes with policy recommendations to lower AI adoption barriers, bolster human capital, and align public and private expectations to foster a sustained, AI-driven growth trajectory.Â
Keywords:	Artificial Intelligence (AI), Economic Growth, Human Capital, Threshold Investment, Multiple Equilibria, Strategic ComplementarityÂ
JEL:	O33 O40 O41 J24
URL:	https://d.repec.org/n?u=RePEc:kyo:wpaper:1125

Generative AI, Productivity and the Future of Work

By:	Greg Cancelada
Abstract:	Generative AI is lifting productivity and transforming the future of work. Learn more with insights from an expert and real-world data on AI adoption and impact.
Date:	2025–10–08
URL:	https://d.repec.org/n?u=RePEc:fip:l00100:102779

The Global Trade Effects of the AI Infrastructure Boom

By:	Francois de Soyres; Alex Haag; Mike Liu; Eva Van Leemput
Abstract:	Artificial intelligence (AI) has become a key driver of the global economic outlook, underscored by the unprecedented scale of announced investment commitments aimed at expanding AI-related infrastructure. The AI boom is also increasingly influencing international trade by boosting demand for critical inputs and intermediate goods needed to build data centers.
Date:	2026–02–13
URL:	https://d.repec.org/n?u=RePEc:fip:fedgfn:102801

How Effectively Can Current LLMs Analyze Macrofinancial Issues?

By:	Paola Ganum; Tohid Atashbar
Abstract:	This paper empirically evaluates the ability of current Large Language Models (LLMs) to analyze macrofinancial coverage in IMF Article IV staff reports, using human economists' assessments as a benchmark. We test several GPT models on reports from 2016-2024, assessing their performance on both qualitative ratings and binary questions. Our findings indicate that the latest models can meaningfully assist economists, achieving an average accuracy of 71-75% on ratings and an average exact match rate of 76-81% on binary questions in 2024 across advanced GPT models. However, we find that LLMs tend to assign higher, less-dispersed ratings than human experts and struggle with open-ended questions that require deep contextual judgment. The paper provides quantitative evidence on current LLM accuracy in this domain, explores the drivers of its performance, and discusses key limitations such as optimistic bias.
Keywords:	AI; Large Language Model; Textual Analysis; Macrofinancial Surveillance; IMF Staff Reports; Human-AI Comparison
Date:	2026–02–27
URL:	https://d.repec.org/n?u=RePEc:imf:imfwpa:2026/035

Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?

By:	Wenxi Geng; Dingyuan Liu; Liya Li; Yiqing Wang
Abstract:	Post-hoc explainability is central to credit risk model governance, yet widely used tools such as coefficient-based attributions and SHapley Additive exPlanations (SHAP) often produce numerical outputs that are difficult to communicate to non-technical stakeholders. This paper investigates whether large language models (LLMs) can serve as post-hoc explainability tools for credit risk predictions through in-context learning, focusing on two roles: translators and autonomous explainers. Using a personal lending dataset from LendingClub, we evaluate three commercial LLMs, including GPT-4-turbo, Claude Sonnet 4, and Gemini-2.0-Flash. Results provide strong evidence for the translator role. In contrast, autonomous explanations show low alignment with model-based attributions. Few-shot prompting improves feature overlap for logistic regression but does not consistently benefit XGBoost, suggesting that LLMs have limited capacity to recover non-linear, interaction-driven reasoning from prompt cues alone. Our findings position LLMs as effective narrative interfaces grounded in auditable model attributions, rather than as substitutes for post-hoc explainers in credit risk model governance. Practitioners should leverage LLMs to bridge the communication gap between complex model outputs and regulatory or business stakeholders, while preserving the rigor and traceability required by credit risk governance frameworks.
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2602.18895

AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

By:	Wentao Zhang; Mingxuan Zhao; Jincheng Gao; Jieshun You; Huaiyu Jia; Yilei Zhao; Bo An; Shuo Sun
Abstract:	The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge tests to interactive trading simulations. However, current evaluations of real-time trading performance overlook a critical failure mode: severe behavioral instability in sequential decision-making under uncertainty. We empirically show that LLM-based trading agents exhibit extreme run-to-run variance, inconsistent action sequences even under deterministic decoding, and irrational action flipping across adjacent time steps. These issues stem from stateless autoregressive architectures lacking persistent action memory, as well as sensitivity to continuous-to-discrete action mappings in portfolio allocation. As a result, many existing financial trading benchmarks produce unreliable, non-reproducible, and uninformative evaluations. To address these limitations, we propose AlphaForgeBench, a principled framework that reframes LLMs as quantitative researchers rather than execution agents. Instead of emitting trading actions, LLMs generate executable alpha factors and factor-based strategies grounded in financial reasoning. This design decouples reasoning from execution, enabling fully deterministic and reproducible evaluation while aligning with real-world quantitative research workflows. Experiments across multiple state-of-the-art LLMs show that AlphaForgeBench eliminates execution-induced instability and provides a rigorous benchmark for assessing financial reasoning, strategy formulation, and alpha discovery.
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2602.18481

Measuring Online Media Ideology with Large Language Models and "Multi-Cue Classification"

By:	da Silva, Lucas Paulo (Trinity College Dublin)
Abstract:	Measuring media ideology is essential for researching media bias, media effects, and various important topics in political science, communication, and other social sciences. However, given journalistic norms of objectivity and the complexity of ideology, measuring media ideology accurately is uniquely challenging. Large language models (LLMs) have become valuable tools in this endeavor. Based on media communication theories, I argue that media ideology is expressed via different cues -- the topic, argument, framing, criticism, and sources of the media content -- and that LLMs often miss these. Standard methods of LLM classification also offer little control, flexibility, and data granularity to researchers. Drawing on insights about computational and quantitative measurement methodologies, I introduce the "Multi-Cue Classification" (MQ-Class) approach. With MQ-Class, an LLM classifies the different ideological cues separately and researchers then apply pre-specified weights and thresholds to combine them into one label per text. I compare standard LLM and MQ-Class methods using two example tasks -- classifying the economic and cultural ideologies of a novel sample of online media articles. Across multiple tests, MQ-Class is more accurate and puts researchers "back in the driver's seat." I conclude by discussing how MQ-Class could be implemented for other classification tasks and data.
Date:	2026–02–20
URL:	https://d.repec.org/n?u=RePEc:osf:socarx:zmtqp_v1

"Artificial Intelligence: Friend, Foe, Fraud"

By:	L. Randall Wray
Abstract:	The over-hyped Dot.com revolution bubbled and crashed at the end of the 1990s, leaving a largely unused physical and virtual infrastructure that eventually supported the rise of social media that did--indeed--transform life. Not necessarily in a good way. As Robert Gordon famously claimed, you can see the evidence of the digital revolution everywhere except in the data. Still, many billionaires were minted. After nearly a quarter century of growth, it seemed to have run its course until digital tech moved into the payments system promising another revolution based on cryptocurrencies. That, too, was over-hyped until Trump's reelection loosened rules to allow crypto to infect the financial system, targeting in particular the accumulated retirement savings of Americans. More billionaires minted. As P.T. Barnum (purportedly) proclaimed, "there's a sucker born every minute" and they add up but the number is still finite. The latest revolution is AI and it has generated the biggest bubble, by far. We are still in the early stages, but not only is AI almost single-handedly driving the stock market, it is also driving the "real" economy with its investments in data centers. One-hundred and three American billionaires were created since 2024, many of those owing to AI-related stock prices and investments. This paper will look in detail at the claims made for AI, the financial arrangements that are supporting its growth, and the dangers it poses for the US (and global) economies. While some argue that the current bubble looks little like the Dot.com bubble, that is true, but beside the point. The fragile financing of the AI bubble looks much more like the financial shenanigans that crashed into the Global Financial Crisis, and--unlike the Dot.com bubble that left us with a physical infrastructure that would eventually prove useful--the AI bubble will leave behind waste and destruction.
Keywords:	Artificial Intelligence; financial fragility, AI bubble; tech billionaires; financial fraud; technological revolution; Dot.com bubble; Global Financial Crisis; fraud; innovation; labor displacement by robots
JEL:	B52 E22 E32 O11 O16 O31 O38 O43 P17
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:lev:wrkpap:wp_1107

This nep-ain issue is ©2026 by Ben Greiner. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.