nep-big New Economics Papers
on Big Data
Issue of 2026–05–25
eleven papers chosen by
Tom Coupé, University of Canterbury


  1. Analyzing Carbon Removal Technology Hype Cycles Through Large Language Models By Medha Nag Kommaghatta Girish; Reinhard Madlener
  2. Using DSGE and Machine Learning to Forecast Public Debt for France By Emmanouil Sofianos; Thierry Betti; Theophilos Papadimitriou; Amélie Barbier-Gauchard; Periklis Gogas
  3. Nowcasting Italian Municipal Income with Nightlights: A Deep Learning Approach By Massimo Giannini
  4. Monetary Policy in the Media Spotlight: Sentiments, Signals, and Economic Impact By Firmin Ayivodji; Etienne Briand; Kevin Moran; Dalibor Stevanovic
  5. GenAI-Based Index of Financial Constraints By Bektemir Ysmailov
  6. A Market-Rule-Informed Neural Network for Efficient Imbalance Electricity Price Forecasting By Runyao Yu; Julia Lin; Derek W. Bunn; Jochen Stiasny; Wentao Wang; Yujie Chen; Tara Esterl; Peter Palensky; Jochen L. Cremer
  7. Automating Evidence Synthesis: A Comparative Evaluation of Large Language Models for Data Extraction By Aditya Retnanto; Yohan Iddawela; Elaine Tan
  8. Pegs, Floats, and Forests: A Machine Learning Revisit of Exchange Rate Regimes and Growth in Transition Economies By Marjan Petreski
  9. Revealing Life Preferences Through LLMs By Omar Abdel Haq; Amitabh Chandra; Tomáš Jagelka; Erzo F.P. Luttmer; Joshua Schwartzstein
  10. Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text By Francesco A. Fabozzi; Dasol Kim; William N. Goetzmann
  11. When development finance spurs entrepreneurship: New evidence from 5 million projects using a machine learning classifier By Werner, Sven; Trotter, Philipp

  1. By: Medha Nag Kommaghatta Girish (RWTH Aachen University); Reinhard Madlener (1- Institute for Future Energy Consumer Needs and Behavior (FCN), School of Business and Economics / E.ON Energy Research Center, RWTH Aachen University, Mathieustrasse 10, 52074 Aachen, Germany; 2- Department of Industrial Economics and Technology Management, Norwegian University of Science and Technology (NTNU), Sentralbygg 1, Gløshaugen, 7491 Trondheim, Norway. November 2023)
    Abstract: This study examines whether Large Language Models (LLMs) can support the development of sentiment-based indicators for technological hype cycles, characterized by optimistic media language during hype phases and negative sentiment during periods of disillusionment. The rapid growth of digital news makes manual tracking of sentiment trends impractical. Traditional computational approaches relying on basic natural language processing often fail to capture context in long-form texts. LLMs offer a scalable alternative by enabling context-aware sentiment analysis across large collections of complex news articles. The study introduces an LLM-driven methodology to analyze temporal sentiment patterns in English-language news coverage over a 16-year period, focusing on carbon removal technologies. The approach extracts sentiments from major news sources and maps them over time to identify media-driven hype dynamics. This framework enables systematic analysis of technologies such as Bioenergy with Carbon Capture and Storage, afforestation/reforestation, Direct Air Capture, and ocean-based carbon capture. The key finding is that media sentiment tracking can complement other innovation indicators in the mapping of a Hype Cycle, revealing that CRT domain as a whole is heading towards a plateau of productivity.
    Keywords: Carbon Removal Technologies; Hype Cycles; Large Language Models; Sentiment Analysis; Media Discourse
    Date: 2026–01
    URL: https://d.repec.org/n?u=RePEc:ris:fcnwpa:022474
  2. By: Emmanouil Sofianos (BETA - Bureau d'Économie Théorique et Appliquée - AgroParisTech - UNISTRA - Université de Strasbourg - Université de Haute-Alsace (UHA) - Université de Haute-Alsace (UHA) Mulhouse - Colmar - UL - Université de Lorraine - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement); Thierry Betti (BETA - Bureau d'Économie Théorique et Appliquée - AgroParisTech - UNISTRA - Université de Strasbourg - Université de Haute-Alsace (UHA) - Université de Haute-Alsace (UHA) Mulhouse - Colmar - UL - Université de Lorraine - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement); Theophilos Papadimitriou (DUTH - Democritus University of Thrace); Amélie Barbier-Gauchard (BETA - Bureau d'Économie Théorique et Appliquée - AgroParisTech - UNISTRA - Université de Strasbourg - Université de Haute-Alsace (UHA) - Université de Haute-Alsace (UHA) Mulhouse - Colmar - UL - Université de Lorraine - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement); Periklis Gogas (DUTH - Democritus University of Thrace)
    Abstract: Forecasting public debt is essential for effective policymaking and economic stability, yet traditional approaches face challenges due to data scarcity. While machine learning (ML) has demonstrated success in financial forecasting, its application to macroeconomic forecasting remains underexplored, hindered by short historical time series and low-frequency (e.g., quarterly/annual) data availability. This study proposes a novel hybrid framework integrating dynamic stochastic general equilibrium (DSGE) modeling with ML techniques to address these limitations, focusing on the evolution of France's public debt. We first generate a large artificial macroeconomic dataset using an estimated DSGE model for France, which allows for efficient training of ML algorithms. These trained models are then applied to actual historical data for directional debt forecasting. The results show that the best machine learning model is an XGBoost achieving 90% accuracy, outperforming an elastic net model, used as benchmark. Our results highlight the viability of combining structural economic models with data-driven techniques to improve macroeconomic forecasting.
    Keywords: public debt, machine learning, France, forecasting, DSGE, DSGE forecasting France machine learning public debt
    Date: 2026–03–05
    URL: https://d.repec.org/n?u=RePEc:hal:journl:hal-05620169
  3. By: Massimo Giannini
    Abstract: This paper assesses whether NASA Black Marble nightlight intensity can serve as an early indicator of annual taxable income at the Italian municipal level, where official data are released with a 12--18 month lag. Using a panel of 7{, }631 municipalities over 2012--2021, we compare four recurrent neural network architectures (LSTM, BiLSTM, GRU, Transformer) against six benchmarks: simple persistence, panel fixed effects, autoregressive distributed lag, and two spatial econometric specifications (SAR, Spatial Durbin) on a queen-contiguity matrix. Models are trained on 2012--2019 and evaluated out-of-sample on 2020--2021 with a cross-sectional Diebold--Mariano test. A single-layer GRU achieves a median forecast error of 1.07 million euros across the cross-section of municipalities -- approximately $4\%$ of the median municipal IRPEF income of 29 million euros -- statistically dominating every benchmark (DM $>4$ against persistence, $>40$ against spatial linear models, all $p
    Date: 2026–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2605.08782
  4. By: Firmin Ayivodji (International Monetary Fund); Etienne Briand (University of Quebec in Montreal); Kevin Moran (Laval University); Dalibor Stevanovic (University of Quebec in Montreal)
    Abstract: News media coverage of monetary policy is not a passive transcript of central-bank communication: it filters announcements, macroeconomic news, and editorial choices into narratives that move expectations and policy decisions. We embed media sentiment into a behavioral New-Keynesian model in which the central bank reacts to sentiment and sentiment follows an explicit law of motion. We construct monetary-policy sentiment indicators from more than 50, 000 Canadian newspaper articles using dictionary methods, transformer models, and a generative-AI framework. Media sentiment shifts household inflation and wage expectations, improves out-of-sample forecasts of GDP growth and inflation, and loads positively on the Bank of Canada's estimated Taylor rule once treated as endogenous. A Bayesian SVAR identifies anticipated and unanticipated monetary-policy shocks together with a narrative shock; the narrative shock contributes a non-trivial share of medium-horizon macroeconomic variance, and a counterfactual that shuts down the dynamic feedback from media sentiment attenuates the propagation of monetary policy to output and prices.
    Keywords: Monetary policy, text analysis, news media, machine learning, forecasting
    JEL: E52 E58 E71 D84 C32 C55
    Date: 2026–05
    URL: https://d.repec.org/n?u=RePEc:bbh:wpaper:26-03
  5. By: Bektemir Ysmailov (Nazarbayev University, Graduate School of Business)
    Abstract: I construct a new measure of financial constraints by applying a large language model to narrative disclosures in firms' Management's Discussion and Analysis from Form 10-K filings. The model evaluates each filing as a finance expert and classifies the firm's external financing difficulty on an ordered scale, producing the GenAI FC Index. The index captures contextual signals - such as nuanced liquidity discussions - that traditional accounting-based and prior text-based proxies often miss. It behaves sensibly in both the time series and cross-section and shows only moderate correlations with existing measures, indicating that it contains distinct information. Behavioral tests reveal that firms classified as constrained recycle far less equity and are substantially more likely to omit dividends, and less likely to initiate or increase them. Across these settings, the GenAI FC Index yields stronger and more consistent behavioral separation than benchmark text-based measures. The results demonstrate that generative AI can extract economically meaningful information about firms' financing frictions at scale.
    Keywords: financial constraints, generative AI (GenAI), large language models (LLMs), textual analysis, MD&A disclosures, corporate finance
    JEL: G30 G32 M41 C81
    Date: 2026–01
    URL: https://d.repec.org/n?u=RePEc:asx:nugsbw:2026-01
  6. By: Runyao Yu; Julia Lin; Derek W. Bunn; Jochen Stiasny; Wentao Wang; Yujie Chen; Tara Esterl; Peter Palensky; Jochen L. Cremer
    Abstract: Accurate and efficient imbalance electricity price forecasting is critical for industrial energy trading systems, especially as battery assets and automated bidding pipelines increasingly participate in balancing markets. However, real-time forecasting is complicated by nonlinear market-rule-based price formation, heterogeneous input signals, and incomplete data availability caused by communication delays, publication lags, and measurement outages. This paper proposes a market-rule-informed neural forecasting framework that embeds imbalance price formation rules into the latent space of an expressive neural network. The proposed framework preserves raw signal information while exploiting transparent market-rule priors. We further analyze operational robustness by removing price-component information and characterize how forecasting performance scales with input length and forecasting horizon. Experimental results show that the proposed model achieves competitive forecasting performance with substantially fewer trainable parameters and shorter training time than generic deep learning baselines. Experimental results show that the proposed model achieves competitive forecasting performance with substantially fewer trainable parameters and shorter training time than generic deep learning baselines, demonstrating that market-rule priors and expressive neural networks should be jointly used for accurate and computationally sustainable forecasting in industrial energy trading applications. The implementation is publicly available at https://runyao-yu.github.io/MRINN/.
    Date: 2026–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2605.09061
  7. By: Aditya Retnanto (Asian Development Bank); Yohan Iddawela (Asian Development Bank); Elaine Tan (Asian Development Bank)
    Abstract: Systematic reviews and meta-analyses (SRMAs) are important tools for evidence synthesis but have historically required substantial manual effort, particularly during the data extraction phase. To address this bottleneck, we developed and evaluated an automated pipeline that utilizes large language models (LLMs) to ingest full text scientific articles and extract structured metadata. We benchmarked the performance of leading models, including Gemini 2.5 Pro, GPT-5, and Sonnet 4.0, across two distinct domains: mobile health interventions and education. Our results indicate that Gemini 2.5 Pro achieved the strongest performance in qualitative metadata extraction and outcome identification. However, quantitative metadata extraction remained a significant challenge. Models struggled to interpret complex data across multiple tables and failed to calculate effect sizes when only raw figures were reported. Crucially, we find that human annotators often applied implicit filtering criteria not documented in the coding manual, which made benchmarking the results challenging. We discuss the implications of these findings, emphasizing that while LLMs can accelerate the coding process, reliable automation requires significantly more prescriptive coding manuals to strictly steer model behavior and ensure fair benchmarking.
    Keywords: evidence synthesis automation;large language models (LLMs);data extraction benchmarking;systematic reviews and meta-analyses (SRMA)
    JEL: C88
    Date: 2026–05–15
    URL: https://d.repec.org/n?u=RePEc:ris:adbewp:022484
  8. By: Marjan Petreski
    Abstract: This paper combines traditional panel econometrics with random forest machine learning to revisit the relationship between exchange rate regimes and economic growth for 27 transition economies over 1991-2019. Exploiting the Couharde-Grekou (2024) probabilistic synthesis classification, the random forest approach non-parametrically confirms and sharpens what fixed-effects and system GMM estimation establish parametrically intermediate exchange rate regimes consistently underperform fixed arrangements, with growth penalties ranging from -1.0 to -10.4 percentage points, while floating regimes show negative but largely insignificant differentials. Beyond regime effects, the machine learning analysis reveals that the intermediate regime penalty is sharpest precisely where institutions are weakest - non-parametric validation that institutional capacity, not regime label alone, determines whether exchange rate anchoring pays off. The regime-growth relationship is further concentrated in the pre-2003 stabilization era and is absent among EU member economies, suggesting the growth dividend from exchange rate anchoring eroded as institutional convergence advanced. Together, these findings demonstrate how machine learning variable importance metrics can corroborate and enrich causal inference from panel methods, while supporting the view that exchange rate anchoring carried a meaningful credibility dividend during the formative phase of transition.
    Date: 2026–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2605.17391
  9. By: Omar Abdel Haq; Amitabh Chandra; Tomáš Jagelka; Erzo F.P. Luttmer; Joshua Schwartzstein
    Abstract: Large Language Models (LLMs) are trained on a prodigious corpus of human writing and may reveal human preferences over characteristics of life courses, such as income, longevity, and working conditions. We present OpenAI's GPT-5.4 and a broadly representative sample of Americans with pairs of life stories and ask them to choose the life they would prefer for themselves. A person's choice is better predicted by the LLM's choice than by another person’s choice over the same stories, and LLM valuations of several life attributes are similar to those derived from human responses. Our results suggest that LLM responses offer a scalable and cost-effective complement to existing methods for studying human preferences.
    JEL: D0 H0 I0
    Date: 2026–05
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:35185
  10. By: Francesco A. Fabozzi; Dasol Kim; William N. Goetzmann
    Abstract: We introduce a novel approach to emotion modeling that shifts the focus from identification to evaluation, addressing the limitations of discrete classification in applied domains such as finance. By constructing a dataset of emotional intensity scores and fine-tuning open-weight generative language models to output continuous values from 0-100, we demonstrate a more expressive, generalizable framework for sentiment and emotion analysis. Our findings not only outperform classification baselines but also reveal surprising generalization capabilities and transfer effects to related constructs such as sentiment and arousal. This work contributes to the interdisciplinary recontextualization of NLP by introducing emotion intensity evaluation as an alternative to classification, arguing that this shift better aligns with the needs of domains--such as finance--where the degree of emotional content is central to interpretation and decision-making.
    Date: 2026–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2605.16613
  11. By: Werner, Sven; Trotter, Philipp
    Abstract: Development finance increasingly funds entrepreneurship in developing countries, but evidence of its impact on entrepreneurship is mixed. Existing studies analyze total development finance flows as entrepreneurship-specific development finance data did not previously exist. By training and validating a machine-learning classifier on development finance project descriptions (2000-2022; 5 million projects; 97% accuracy), we introduce a scalable, replicable measure of specific entrepreneurship-support development finance (ESDF). Crucially, this measure allows us to assess which entrepreneurship margins respond to development finance. In a 19-year panel of 50 developing countries, two-way fixed-effects regressions show that higher ESDF is associated with higher entrepreneurial intentions, while total development finance is not. ESDF is not significantly linked to early-stage entrepreneurial activity, however, suggesting conversion bottlenecks in current entrepreneurial processes.
    Abstract: Entwicklungsfinanzierung richtet sich zunehmend auf die Förderung von Entrepreneurship im Globalen Süden. Die makroökonomische Evidenz zur Wirksamkeit dieser Förderung ist bislang jedoch uneinheitlich. Bisherige Studien greifen auf aggregierte Daten zur Entwicklungsfinanzierung zurück, da spezifische Daten zur Entrepreneurship-Förderung bislang nicht verfügbar waren. In diesem Papier entwickeln wir ein skalierbares und replizierbares globales Maß für die Förderung von Entrepreneurship durch Entwicklungsfinanzierung (entrepreneurship-support development finance, ESDF). Dazu trainieren und validieren wir ein Machine-Learning-Klassifikationsmodell auf Basis der Beschreibungen von 5 Millionen Entwicklungshilfeprojekten aus den Jahren 2000 bis 2022 (Genauigkeit des Modells: 97 %). Mit diesem Maß kann die Wirkung von ESDF auf verschiedene Stufen des Gründungsprozesses untersucht werden. Auf Basis eines Panels von 50 Ländern über 19 Jahre zeigen Regressionen mit Länder- und Jahreseffekten, dass ein höheres ESDF-Volumen mit stärkeren Gründungsabsichten einhergeht, während sich für aggregierte Entwicklungsfinanzierung kein entsprechender Zusammenhang zeigt. Zugleich ergibt sich kein signifikanter Zusammenhang zwischen ESDF und der Gründungsaktivität. Dies spricht dafür, dass zusätzliche Förderung zwar die Gründungsneigung erhöht, sich aber nicht automatisch in tatsächliche Gründungen übersetzt.
    Keywords: Entrepreneurship-support development finance, international assistance, entrepreneurial intentions, early-stage entrepreneurship, machine learning classification
    JEL: F35 O19 L26 C23 C45
    Date: 2026
    URL: https://d.repec.org/n?u=RePEc:zbw:rwirep:341094

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.