nep-big 2026-05-04 papers

on Big Data

Issue of 2026–05–04
thirteen papers chosen by
Tom Coupé, University of Canterbury

Dual Interpretation of Machine Learning Forecasts (Philippe Goulet Coulombe, Maximilian Göbel, Karin Klieber) By Maximilian Göbel; Philippe Goulet Coulombe; Karin Klieber
Quantifying Minsky cycles By Ristolainen, Kim
Artificial Intelligence Models for Nowcasting Economic Activity By Jennifer Peña; Katherine Jara; Fernando Sierra
Assessing wage inequality with machine learning: Approaches for measuring the adjusted gender pay gap By Plüghan, Oliver; Rehfeld, Katharina-Maria
Using distributional random forests for the analysis of the income distribution By Martin Biewen; Stefan Glaisner; Simon Zeller
A Blended Data Approach to Measuring Monthly Housing Starts: Satellite Imagery, Survey Data and More! By Nicole Czaplicki; Colin J. Shevlin; Hector R. Ferronato; Aidan D. Smith; Dwarakh V. Nayam; Lei Peng; Scott W. Springer; Doren Walker
The Heterogeneous Earnings Impact of Job Loss Across Workers, Establishments, and Markets By Susan Athey; Lisa K. Simon; Oskar Nordström Skans; Johan Vikström; Yaroslav Yakymovych
Improving disaggregated short-term food inflation forecasts with webscraped data (Christian Beer, Robert Ferstl, Bernhard Graf) By Christian Beer; Robert Ferstl; Bernhard Graf
An Assessment of the Effects of Monetary Policy Communication in Chile By Mario González-Frugone; Ignacio Rojas
Strategic Reasoning and Sensitivity to Stakes in the Dictator and Ultimatum Games: LLMs vs. Human Proposers By Solomon Polachek; Kenneth Romano; Ozlem Tonguc
Worker Responses to Immigration Across Firms: Evidence from Colombia By Lukas Delgado-Prieto
Same Model, Different Politics? How Language Shapes AI Ideology By Eduardo Levy Yeyati; César M. Ciappa; Milagros Onofri
Road Investment and Violence in DRC: Perishable Peace Dividends By Jevgenijs Steinbuks; Peer Schouten; Mathilde Lebrand; Hannes Mueller

Dual Interpretation of Machine Learning Forecasts (Philippe Goulet Coulombe, Maximilian Göbel, Karin Klieber)

By:	Maximilian Göbel (Brain); Philippe Goulet Coulombe (Université du Québec à Montréal); Karin Klieber (Oesterreichische Nationalbank)
Abstract:	Machine learning predictions are typically interpreted as the sum of contributions of predictors. Yet, each out-of-sample prediction can also be expressed as a linear combination of in-sample values of the predicted variable, with weights corresponding to pairwise proximity scores between current and past economic events. While this dual route leads nowhere in some contexts (e.g., large cross-sectional datasets), it provides sparser interpretations in settings with many regressors and little training data—like macroeconomic forecasting. In this case, the sequence of contributions can be visualized as a time series, allowing analysts to explain predictions as quantifiable combinations of historical analogies. Moreover, the weights can be viewed as those of a data portfolio, inspiring new diagnostic measures such as forecast concentration, short position, and turnover. We show how weights can be retrieved seamlessly for (kernel) ridge regression, random forest, boosted trees, and neural networks. Then, we apply these tools to analyze postpandemic forecasts of inflation, GDP growth, and recession probabilities. In all cases, the approach opens the black box from a new angle and demonstrates how machine learning models leverage history partly repeating itself.
Date:	2025–03–27
URL:	https://d.repec.org/n?u=RePEc:onb:oenbwp:265

Quantifying Minsky cycles

By:	Ristolainen, Kim
Abstract:	We develop a novel sentiment measure from survey forecasts that captures the component of beliefs arising from the systematic misaggregation of public information relative to a machine benchmark based on the same information set. We extend this sentiment measure historically for a panel of 78 countries using machine learning models trained on BERT embeddings of historical news articles (1903-2020). The backcasted sentiment shows that shocks in median sentiment predict credit booms in the non-tradable corporate sector, which prior research has linked to financial crises. We further find that this sentiment component is shaped by memory-related dynamics, as the time elapsed since major crises and the share of young-to-old people in the population predict surges in optimism even when recent economic developments are controlled for. Taken together, the findings provide new historical evidence consistent with the Minsky-Kindleberger view on financial crises.
Keywords:	Survey data, Sentiment, Memory, Machine Learning, Text Data, Credit growth, Financial Crisis
JEL:	E44 E51 G01 D84 G41 E32
Date:	2026
URL:	https://d.repec.org/n?u=RePEc:zbw:bofrdp:340165

Artificial Intelligence Models for Nowcasting Economic Activity

By:	Jennifer Peña; Katherine Jara; Fernando Sierra
Abstract:	This paper investigates whether artificial intelligence techniques—encompassing both machine learning and deep learning models—can enhance the accuracy of now-casts for Chile’s monthly economic activity index (IMACEC). The analysis relies on a large and diverse real-time dataset that includes both traditional macroeco-nomic variables and high-frequency monthly administrative data (from electronic tax records). Three main findings emerge. First, nonlinear models—particularly XGBoost—achieve the lowest root mean squared errors, whereas linear regularized approaches such as SVR and LASSO also show competitive performance. This highlights the value of flexible nonlinear methods and regularized linear approaches when dealing with heterogeneous data. Second, features derived from electronic tax records—such as trade credit volumes and sectoral sales by region—consistently rank among the most important predictors across models. Third, the strongest-performing models—XGBoost, SVR, and LASSO—achieve lower errors than tra-ditional econometric benchmarks, which rely solely on standard macroeconomic aggregates and exclude non-traditional datasets. Overall, the findings show that timely administrative data, combined with AI approaches, can significantly improve economic surveillance and decision-making.
Date:	2025–12
URL:	https://d.repec.org/n?u=RePEc:chb:bcchwp:1058

Assessing wage inequality with machine learning: Approaches for measuring the adjusted gender pay gap

By:	Plüghan, Oliver; Rehfeld, Katharina-Maria
Abstract:	This paper investigates the methodological performance of Ordinary Least Squares (OLS) regression and Random Forest machine learning algorithms in measuring adjusted gender pay gaps. The research is motivated by the European Union's Pay Transparency Directive (2023/970), which mandates that employers report adjusted gender pay gaps. While Oaxaca-Blinder Decomposition and the underlying OLS regression have served as the industry standard for gap estimation, this paper examines whether machine learning approaches can better capture complex, nonlinear compensation relationships. Using synthetic datasets with controlled discrimination parameters, the study compares both methods across two sample sizes and multiple discrimination scenarios. Key findings demonstrate that both methods successfully distinguish between occupational segregation and direct wage discrimination at large sample sizes. However, at smaller sample sizes, Random Forest exhibits substantial instability whereas OLS remains slightly more stable. A methodological adjustment, training Random Forest on the larger population before applying predictions to subsets substantially improves small-sample performance. The paper concludes that OLS regression remains preferable for formal regulatory compliance due to its interpretability and stability, while Random Forest can serve as a complementary validation tool for largescale analysis.
Keywords:	Gender Pay Gap, Pay Transparency, OLS Regression, Random Forest, Wage Discrimination, Unexplained Wage Gap, Adjusted Gender Pay Gap
JEL:	J16 J31 J71 M52 C13 C45
Date:	2026
URL:	https://d.repec.org/n?u=RePEc:zbw:iubhhr:340172

Using distributional random forests for the analysis of the income distribution

By:	Martin Biewen; Stefan Glaisner; Simon Zeller
Abstract:	This paper explores distributional random forests as a flexible machine learning method for analysing income distributions. Distributional random forests avoid parametric assumptions, capture complex interactions among covariates, and, once trained, provide full estimates of conditional income distributions. From these, any type of distributional index such as measures of location, inequality and poverty risk can be readily computed. They can also efficiently process grouped income data and be used as inputs for distributional decomposition methods. We consider four types of applications: (i) estimating income distributions for granular population subgroups, (ii) analysing distributional change over time, (iii) small-area estimation of income distributions, and (iv) purging spatial income distributions of differences in spatial characteristics. Our application based on the German Microcensus provides new results on the socio-economic and spatial structure of the German income distribution.
Keywords:	inequality, poverty, small-area estimation, grouped income data
JEL:	D31 I32
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:crm:wpaper:26051

A Blended Data Approach to Measuring Monthly Housing Starts: Satellite Imagery, Survey Data and More!

By:	Nicole Czaplicki; Colin J. Shevlin; Hector R. Ferronato; Aidan D. Smith; Dwarakh V. Nayam; Lei Peng; Scott W. Springer; Doren Walker
Abstract:	As part of the comprehensive Construction Re-engineering Initiative at the U.S. Census Bureau, alternative data sources are being considered to supplement or replace current data collection methods. For the Survey of Construction (SOC), which measures new residential construction, this includes observing housing starts from satellite imagery in place of the current interviews for housing starts conducted by field representatives. Satellite images are obtained monthly for a subset of places in the SOC sample. Convolutional neural network models are then applied to images to predict likely new residential construction projects, with the current focus being single-family housing starts. Several post prediction processing steps are applied including exclusions based on intersections with known buildings or roads, treatments for missing data due to cloud cover, and adjustments for the length of time between consecutive images, to ultimately produce place level estimates of housing starts. These place level estimates are then combined with the existing building permit level survey data to produce estimates of West South Central division level housing starts, an experimental data product from the Census Bureau.
JEL:	C45 C8 C80
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:35113

The Heterogeneous Earnings Impact of Job Loss Across Workers, Establishments, and Markets

By:	Susan Athey; Lisa K. Simon; Oskar Nordström Skans; Johan Vikström; Yaroslav Yakymovych
Abstract:	Using rich Swedish administrative data, we apply causal machine learning methods to study how earnings losses after job displacement vary with observable characteristics that may be relevant for targeting policy interventions for workers. Heterogeneity in effects is as large within as across worker groups defined by age and schooling, and as large within as across establishments. A substantial portion of cross-establishment heterogeneity can be explained by industry and local labor market characteristics, suggesting a role for place- and industry-based targeting. The largest losses are concentrated among already vulnerable workers, indicating that well-designed targeting policies can improve both efficiency and equity.
Keywords:	Plant closures, heterogeneous effects, GRF
JEL:	J65 J21 J31 C45
Date:	2026–03
URL:	https://d.repec.org/n?u=RePEc:crm:wpaper:26075

Improving disaggregated short-term food inflation forecasts with webscraped data (Christian Beer, Robert Ferstl, Bernhard Graf)

By:	Christian Beer (Oesterreichische Nationalbank, Economic Analysis Division); Robert Ferstl (Off-Site Banking Analysis and Strategy Division); Bernhard Graf
Abstract:	This study examines the effectiveness of using webscraped data to predict price developments in the Austrian food retail sector. We calculate monthly nowcasts of price changes based on daily price data collected by the OeNB since mid-2020, using Eurostat methodology for price index calculation, along with further details provided by the national statistics office. We assess the quality of our nowcasts by comparing them with various baseline models and more advanced time series methods also covering machine learning approaches. Our findings indicate that webscraped data are a useful way to obtain more accurate nowcasts with a time advantage, amounting to several weeks, over traditional data sources. In addition, we are the first, to our knowledge, to explore the possibility of using the improved accuracy of the nowcasts as a basis for disaggregated short-term forecasts that extend up to one quarter. While direct forecasts at higher levels of aggregation produce slightly more accurate overall metrics, indirect forecasts derived from disaggregated data provide superior insights into the underlying dynamics of specific sub-components. Our results show that more advanced time series models have trade-offs in terms of computational efficiency while performing very similarly to more traditional methods. These findings have implications for policymakers who aim to develop an effective system for real-time monitoring of inflation dynamics at a very granular level.
Keywords:	Webscraping, Inflation forecasting, Time series models
JEL:	C22 C81 E31 E37
Date:	2025–01–16
URL:	https://d.repec.org/n?u=RePEc:onb:oenbwp:262

An Assessment of the Effects of Monetary Policy Communication in Chile

By:	Mario González-Frugone; Ignacio Rojas
Abstract:	In recent decades, central banks have increasingly relied on communication as a policy tool. We use linguistic methods to extract the latent information from monetary policy documents in Spanish of the Central Bank of Chile and use this information to reassess the impact of monetary policy surprises on financial markets. As a by-product of this analysis, we present a methodology for analyzing central bank documents in Spanish, construct a sentiment index that captures the policy tilt of each document, and examine whether these documents provide information that can help anticipate changes in the monetary policy rate. The sentiment index is categorized into key economic topics—Inflation, Activity, External Conditions, Financial Conditions, Expectations, and Risk—enabling a detailed understanding of the dynamics behind policy bias. Our findings reveal that monetary policy rate (MPR) surprises have a strong and immediate effect on the yield curve. However, this impact is short-lived and diminishes along the yield curve. In contrast, sentiment surprises in press releases exhibit a weaker, but more persistent effect across instruments. Regarding the minutes, our results suggest that the information they contain is generally already priced in, as sentiment surprises from these documents do not significantly affect the yield curve. Conversely, surprises in the Monetary Policy Report (IPoM) have a positive effect on two-year interest rates, indicating that these reports provide new information that shapes medium-term monetary policy expectations. In terms of anticipation, we find that Central Bank policy documents provide enough information to anticipate policy rate movements.
Date:	2025–08
URL:	https://d.repec.org/n?u=RePEc:chb:bcchwp:1053

Strategic Reasoning and Sensitivity to Stakes in the Dictator and Ultimatum Games: LLMs vs. Human Proposers

By:	Solomon Polachek; Kenneth Romano; Ozlem Tonguc
Abstract:	This study examines how large language models (LLMs) respond to varying stake sizes in the Dictator and Ultimatum games using the high-stakes design introduced by Andersen et al. (2011). We test ten leading LLMs chosen for their accessibility, prominence, and differences in reasoning capabilities. Results reveal substantial variation across models: Only 5 of 10 models exhibit strategic behavior by offering more in the Ultimatum Game (UG) than in the Dictator Game (DG). Relative to humans, 4 models are consistently more generous, 2 consistently less, and 4 vary with stake size. Only 1 model shows a monotonic decline in UG offers as stakes increase; the remaining 9 are non-monotonic or stable. Unlike humans, most models reduce UG offers when endowed with wealth. Prompting for "human-like" decisions generally increases generosity in the UG. These findings are important for evaluating whether LLMs can serve as realistic proxies for human subjects in behavioral experiments and highlight key limitations and future directions for model development.
Keywords:	Ultimatum Game, Dictator Game, fairness, payoff stakes, artificial intelligence
JEL:	D01 C72 C90
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:crm:wpaper:26110

Worker Responses to Immigration Across Firms: Evidence from Colombia

By:	Lukas Delgado-Prieto
Abstract:	The labor market effects of immigration depend on how firms adjust, yet this aspect remains unexplored in developing countries. This paper studies the mass influx of Venezuelan migrants into Colombia using employer-employee data. As immigrants concentrate in informal employment, formal employment for minimum-wage natives falls, reflecting their substitutability with lower-cost informal workers. The negative effects are stronger in small formal firms, which rely more on informality. A machine learning analysis shows that firm-level factors explain more of the heterogeneity in worker-level impacts. These findings highlight that informality amplifies firms' role in shaping workers' immigration adjustments.
Keywords:	Immigration, Minimum wages, Formal labor markets, Causal forest
JEL:	F22 O15 O17 R23
Date:	2026–02
URL:	https://d.repec.org/n?u=RePEc:crm:wpaper:26041

Same Model, Different Politics? How Language Shapes AI Ideology

By:	Eduardo Levy Yeyati; César M. Ciappa; Milagros Onofri
Abstract:	Recent work measures ideological positioning and drift in large language models (LLMs), but typically assumes that those measurements are invariant to the language of evaluation. This paper tests that assumption using the full Political Compass questionnaire in English and Spanish across three generations of OpenAI models, together with a benchmark comparison against recent Qwen and Mistral releases. Using matched item-level responses, we estimate within-model Spanish–English displacement and assess how language choice affects cross-model comparisons. We find that measured ideological coordinates remain in the same broad region across languages, but are not language-invariant. Spanish–English shifts differ in sign and magnitude across models and axes, and in several cases amount to a substantial share of the inter-model dispersion typically interpreted as ideological drift in English-only audits. The implication is methodological: ideological drift should not be treated as a language-invariant property of a model, but as a measurement outcome conditional on language choice and instrument design. Multilingual audits should therefore report language-specific placements and within-model cross-language displacement rather than extrapolating from English-only measurements.
Keywords:	large language models, ideological drift, multilingual evaluation, Political Compass, language dependence
JEL:	C83 C90 C18 D72
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:udt:wpgobi:wp_gob_2026_07

Road Investment and Violence in DRC: Perishable Peace Dividends

By:	Jevgenijs Steinbuks; Peer Schouten; Mathilde Lebrand; Hannes Mueller
Abstract:	This paper explores the effect of road rehabilitation on violent conflict using a novel, rich dataset of road rehabilitation projects in the Democratic Republic of Congo. The country received massive external investments in transport infrastructure rehabilitation under conditions of endemic conflict, often with the explicit objective of supporting peacebuilding objectives. The paper finds that investments in road rehabilitation deter violence, which decreases significantly by around 5 to 10 percentage points after the completion of road rehabilitation. However, another significant finding, based on large-scale machine learning analysis of remote sensing data of road quality over time, is that the peace dividend of infrastructure investments is perish- able: violence increases again as roads progressively deteriorate. Improved durability and systematic maintenance of roads are thus necessary to extend the "peace dividend" of road investments.
Keywords:	DRC, mining, remote sensing, road infrastructure, violence
JEL:	O18 O19 O55 Q34
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:bge:wpaper:1574

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.