|
on Artificial Intelligence |
| By: | Ansgar Hudde (University of Cologne); Shannon Taflinger (University of Cologne) |
| Abstract: | Open-text questions in quantitative surveys can yield rich information from large samples, but analysing and coding these data using qualitative text analysis is resource-intensive. Large Language Models (LLMs) are a promising tool for scaling up such analyses, reducing time and financial costs. In this paper, we compare the coding accuracy of LLMs with that of student assistants, defining accuracy as agreement with a researcher-coded benchmark dataset. We assess performance on a semi-complex coding task: coding approximately 1, 400 open-ended text responses from young US Americans about dating across party-political lines. A researcher-designed coding scheme, developed through thematic qualitative text analysis of the open-text responses, was applied by LLMs and student assistants. We evaluate models from OpenAI, Anthropic, and Mistral, with and without access to training data. The most advanced models outperform student assistants, and performance further increases with training data, highlighting LLMs’ capability to code open-text responses. Whereas previous research has mainly focused on social media texts, comparatively simple and surface-level coding tasks, and a technically oriented audience, we contribute to the literature by studying a particularly promising use case of open-ended survey responses and by providing practical recommendations to applied social scientists. |
| Keywords: | Large language models, open-ended questions, text analysis |
| JEL: | C81 C45 C83 |
| Date: | 2026–06 |
| URL: | https://d.repec.org/n?u=RePEc:ajk:ajkdps:416 |
| By: | Shang Wu; Randol Yao |
| Abstract: | This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science. |
| Date: | 2026–05 |
| URL: | https://d.repec.org/n?u=RePEc:arx:papers:2605.26662 |
| By: | Soria, Chris |
| Abstract: | What we learn from open-ended survey data depends on who—or what—does the coding. Large Language Models (LLMs) promise to democratize qualitative analysis, but do high agreement rates translate into equivalent thematic findings? This study compares eight LLMs to human annotators on a multilabel coding task using 3, 200 responses from the UC Berkeley Social Networks Study, comprising over 19, 000 coding decisions. Although LLM-human reliability does not match human-human reliability overall, LLMs approach human performance on simpler tasks and can serve as useful additional coders for generating consensus labels. Compared to a gold-standard human consensus, models achieve 82–97% per-category agreement, but macro F1 is lower and response-level similarity is lower still: even the best model reproduces the full human label set for fewer than 60% of responses. Yet high agreement masks thematic divergence. Models systematically over-identify themes, assigning 67% more categories per response, especially for categories requiring greater interpretive judgment. Models also show lower agreement for some demographic groups. These gaps are partly explained by response characteristics such as length, clarity, and atypicality, and some persist after controls, with implications for studies of populations whose response styles diverge from the corpus average. At the sample level, models largely preserve the overall thematic narrative: human and model category rankings correlate strongly (pooled Spearman's ρ=0.75), and top-performing models achieve approximately 80% directional agreement on demographic patterns. Concrete behavioral questions, such as reasons for moving or strategies for making friends, show especially strong alignment. Yet systematic over-classification can still shift narratives about how specific groups behave, leading researchers to report patterns that the human gold standard does not support. |
| Date: | 2026–06–03 |
| URL: | https://d.repec.org/n?u=RePEc:osf:socarx:85kyd_v1 |
| By: | Kononykhina, Olga; Haensch, Anna-Carolina; Kreuter, Frauke |
| Abstract: | Assigning free-text job descriptions to standardised taxonomies is a persistent bottleneck in survey research and official statistics. Large language models (LLMs) offer a promising path toward automation, but each step in the pipeline involves both model architecture and measurement choices about how an occupation should be represented. Through 119 experiments on German survey data, we systematically vary the textual representation of occupational categories, embedding models, LLMs, and prompt design. Category representation changes retrieval accuracy by 8–23 percentage points and classification by 11–21. Prompt role and abstention behaviour are model-specific and must be validated before deployment. The dominant source of variance, however, sits outside model measurement choices. How respondents describe their work matters more than any model or design choice (ICC = 0.76). |
| Date: | 2026–05–19 |
| URL: | https://d.repec.org/n?u=RePEc:osf:socarx:g6wjy_v1 |
| By: | Chen Zhu; Xiaolu Wang; Weilong Zhang |
| Abstract: | Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p |
| Date: | 2026–06 |
| URL: | https://d.repec.org/n?u=RePEc:arx:papers:2606.12848 |
| By: | Daniel Martin |
| Abstract: | Artificial intelligence (AI) systems increasingly assist human experts, but the consequences of AI assistance on productivity can be heterogeneous. Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b) provide evidence that two characteristics, ability and belief calibration, help to determine the returns to AI assistance. This note shows that their results replicate to a setting where professional radiologists analyze chest X-rays with access to state-of-the-art machine learning predictions. I leverage the public Collab-CXR data repository described by Moehring, Kutwal, Huang, Banerjee, Jacobi, Eber, Mendoza, Chung, Dayan, Gupta, Bui, Truong, Pareek, Langlotz, Lungren, Agarwal, Rajpurkar, and Salz (2025) and first analyzed for human-AI collaboration by Agarwal, Moehring, Rajpurkar, and Salz (2023). To faithfully reproduce the analysis in Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b), I use the radiologist assessments from the repeated-case designs, which include 68 radiologists and 11, 420 paired radiologist-patient-pathology observations. The results of this replication support the external validity of their core findings: lower baseline ability and higher calibration predict larger incremental value from AI. |
| Date: | 2026–06 |
| URL: | https://d.repec.org/n?u=RePEc:arx:papers:2606.12585 |
| By: | Mert Demirer; Leon Musolff; Liyuan Yang |
| Abstract: | How do the productivity effects of AI evolve across successive generations of tools, and to what extent do task-level gains ultimately translate into final output? We study these questions in the context of software development, using data on more than 100, 000 GitHub developers combined with their AI usage telemetry. In a matched event study design, we find that autocomplete, interactive coding agents, and autonomous coding agents each significantly increase coding activity (“commits”), with respective cumulative effects of 40%, 140%, and 180%. These gains, however, attenuate sharply across the production hierarchy: the 180% cumulative effect falls to 50% for the number of projects, and to 30% for actual releases. This pattern is consistent with the weak-link hypothesis: the strong productivity gains from AI are attenuated by human bottlenecks in the production chain, with an estimated elasticity of substitution of 0.25 between AI and human effort, which indicates strong complementarities. We further confirm these results across four major app marketplaces, finding a moderate increase in the number of new apps but no increase in total usage. Large task-level AI productivity gains have therefore translated only partially into shipped and used software thus far. |
| JEL: | D24 L86 O33 |
| Date: | 2026–05 |
| URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:35275 |
| By: | Ilse Lindenlaub; Ryungha Oh; Maria Alejandra Rodriguez; Laura Veldkamp |
| Abstract: | We document and explain the gap between measures of AI exposure and measures of AI adoption in the workplace. This leads us to propose a new AI adoption index based on comparative advantage. Using the representative German DiWaBe employee survey linked to worker and establishment information, we compare worker-reported AI use to prominent exposure measures and find that the relationship is weak. Motivated by this gap, we develop a framework in which adoption depends not only on technical feasibility (i.e., AI’s absolute advantage measured by exposure) but on profitability (i.e., AI’s comparative (dis)advantage relative to a specific worker), balancing AI productivity against AI user costs and worker productivity against wages. We operationalize this framework at the task level by (i) estimating worker productivity relative to pay, (ii) mapping exposure indices into AI productivity, and (iii) inferring task-specific AI user costs from revealed-preference adoption. The resulting occupation-level index accounts for almost 60% of cross-occupation variation in observed AI adoption, compared to 14% for an exposure-only model. The two approaches diverge substantially for approximately 30% of workers, highlighting that comparative advantage—not exposure alone—is crucial for assessing AI’s labor-market impact. |
| JEL: | E0 J2 J3 |
| Date: | 2026–05 |
| URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:35271 |
| By: | Risi, Gianluca |
| Abstract: | This paper investigates the impact of digitalization and AI on wage inequality both between and within task-based groups of workers across Italian provinces. We contribute along two aspects: building a novel vertical categorization of occupational task dimensions designed to capture the distinct labor market implications of traditional digitalization and AI, and constructing the AI Occupational Catch-Up Index (AI-OCUI), the first empirical operationalization of the OECD AI Capability Indicators framework. Using a panel regression model for Italian NUTS3 regions over 2015-2018, we find that neither technology affects between-group inequality, while traditional digitalization reduces wage dispersion within the cognitive group and AI exposure compresses inequality within the non-routine group - a differentiation consistent with the distinct task profiles targeted by each technological wave. Both effects are attenuated in cities, where agglomeration dynamics moderates the equalizing potential of digital technologies. These findings contribute to the growing literature on the wage implications of digital transformation, with relevant implications for policy. |
| Keywords: | Digitalization, Artificial Intelligence, AI Capability Indicators, Task-based framework, Wage inequality, Italian provinces |
| JEL: | J24 O33 R11 R23 |
| Date: | 2026 |
| URL: | https://d.repec.org/n?u=RePEc:zbw:esprep:340911 |
| By: | Ozili, Peterson K |
| Abstract: | This study examines the asymmetric effect of artificial intelligence (AI) innovation on economic growth in 50 countries from 2000 to 2020 using the quantile regression method. The findings reveal that AI innovation stimulates economic growth at low and middle tail of the economic growth distribution. Interaction analyses reveal that the use of AI innovation in the stock market stimulates economic growth while the use of AI innovation to support financial stability and international trade activities diminish economic growth. Asymmetric interaction analyses reveal that: AI innovation stimulates economic growth when countries are experiencing low growth rates; the use of AI innovation in the stock market stimulates economic growth when countries are experiencing high growth rates and in mid-growth emerging market and developing countries; the use of AI innovation to support financial stability activities diminish economic growth when countries are experiencing low growth rates and the use of AI innovation to support international trade activities diminish economic growth when countries are experiencing high growth rates. |
| Keywords: | Quantile regression, asymmetry, economic growth, artificial intelligence, innovation, internet, financial stability, unemployment, trade openness, endogenous growth theory |
| JEL: | O30 O31 O33 O47 |
| Date: | 2026 |
| URL: | https://d.repec.org/n?u=RePEc:pra:mprapa:128950 |
| By: | Ursel Baumann; Zoë B. Cullen; Ester Faia; Annalisa Ferrando; Ricardo Perez-Truglia; Judit Rariga |
| Abstract: | How well does innovation diffuse across geographic boundaries? To shed light on this question, we present a large-scale field experiment involving 3, 300 firms across twelve European Union countries. We elicit firms' perceptions of the share of similar firms in their own country that had invested in artificial intelligence (AI), as well as the corresponding share among similar firms in Germany, France, and Italy. We randomly provide half of the sample with accurate information about both domestic and foreign AI investment. We show that firms substantially underestimate competitors' current AI investment, both domestically and abroad, and that they update their expectations about competitors' future AI investment in response to the information treatment. The treatment also causes a statistically significant increase in firms' own expected AI investment rate (p-value |
| JEL: | C93 D22 L21 O33 |
| Date: | 2026–06 |
| URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:35314 |
| By: | Leonhard Reiter (Universität Wien = University of Vienna); Moritz Joerling (EM - EMLyon Business School); Christoph Fuchs (Universität Wien = University of Vienna); Working Group Artificial Intelligence In Higher Education; Robert Böhm (Universität Wien = University of Vienna) |
| Abstract: | Although Artificial Intelligence (AI) holds immense potential to enhance the educational experience, its use also presents challenges. This research examines the use and misuse of AI tools for university-related tasks. We surveyed 498 students from three faculties at a large European university to, first, identify factors driving their willingness to use AI tools for university-related tasks, and, second, estimate the prevalence of cheating behavior involving the unauthorized use of AI tools in examinations. Specifically, we tested and extended the Technology Acceptance Model 2 (TAM2) by identifying trust and perceived opportunity costs as additional determinants of using AI tools for university-related tasks. To estimate the proportion of students cheating during examinations, we applied a randomized response technique. We discuss the results with respect to the effective and appropriate implementation of AI tools in higher education. Our findings can help educators and policymakers to promote responsible AI use while mitigating its misuse. |
| Keywords: | AI use, Artificial intelligence, AI, education, technology acceptance, AI misuse, TAM, cheating |
| Date: | 2025–10–02 |
| URL: | https://d.repec.org/n?u=RePEc:hal:journl:hal-05626257 |
| By: | Taojie Zhu; Wentao Zhao; Rui Sun; Beidi Luan; Jiacheng Lu; Sinuo Wang; Jing Li; Daxin Jiang; Yonghong He; Zuo Bai |
| Abstract: | Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents. |
| Date: | 2026–05 |
| URL: | https://d.repec.org/n?u=RePEc:arx:papers:2605.28359 |
| By: | Gmyrek, Pawel; Viollaz, Mariana; Winkler, Hernan |
| Abstract: | This article examines how generative artificial intelligence (GenAI) could affect labor markets globally, with particular attention to the uneven distribution of risks and opportunities between advanced and developing economies. Cross-country differences in occupational structure suggest that developing economies face lower aggregate automation exposure than advanced economies but comparable potential for task augmentation. However, disparities in digital infrastructure create an asymmetry: workers in positions vulnerable to automation typically maintain sufficient internet connectivity to experience displacement effects even in low-income settings, while those who could benefit from GenAI augmentation face substantial digital infrastructure gaps that may prevent them from realizing productivity gains. This finding suggests that developing countries may experience the disruptive effects of GenAI faster than its productivity benefits. At the same time, conventional occupational exposure measures systematically overestimate the impact of GenAI in developing countries by assuming uniform task content across economies. Using data from skills surveys, the article demonstrates that workers in developing countries perform substantially fewer non-routine analytical tasks—the primary targets of GenAI—even within occupations classified as highly exposed. These findings highlight the importance of adapting GenAI exposure measures to ref lect developing countries' distance from the technology frontier. |
| Date: | 2026–03–16 |
| URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:11328 |
| By: | Wenbin Wu |
| Abstract: | Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLMs systematically prefer certain financial instruments; can an internal representation with causal leverage over those preferences be identified; and does that representation affect downstream financial decisions? We develop a three-level audit protocol and apply it to Bitcoin. First, a behavioral audit of eight frontier LLMs shows that Bitcoin's ranking among money-like instruments is frame-dependent: models place it around rank 5 of 8 as "reliable money" but near the top under crisis and autonomous-agent frames, and an attribute-swap experiment confirms rankings track functional properties, not names. Second, we open a model's internals: a search across thousands of sparse-autoencoder features in Gemma 3 identifies a dominant Bitcoin-selective feature. Amplifying it shifts the model toward the asset and suppressing it shifts the model away, even when "Bitcoin" never appears in the prompt. Third, we test financial consequences: amplification raises Bitcoin's portfolio share by 5.2 percentage points while suppression lowers it by 4.6 pp, with amplification reallocating within crypto and suppression cutting total crypto exposure. We characterize this as bounded behavioral leverage (leverage meaning causal influence over outputs, not financial leverage): an identifiable internal feature can be perturbed to move financial choices, but only within measurable limits. The framework links internal representations to external recommendations, validated with random controls and mechanism boundaries. As LLMs become autonomous financial agents, this is a first step toward a behavioral layer for emerging know-your-agent (KYA) standards: knowing what an agent prefers, and how far that preference can be moved. |
| Date: | 2026–06 |
| URL: | https://d.repec.org/n?u=RePEc:arx:papers:2606.02528 |