nep-big New Economics Papers
on Big Data
Issue of 2026–04–13
six papers chosen by
Tom Coupé, University of Canterbury


  1. Returns to Education in the United States: A Comparison of OLS and Double Machine Learning Methods By Helal, Al Mansor; Hiraki, Ryotaro; Patrinos, Harry Anthony
  2. School Governance and Learner Performance in Sub-Saharan Africa: A Neural Networks Approach By Sylvain K. Assienin; Auguste K. Kouakou; Christian K. Nda; Loukou L. E. Yobouet
  3. Validating Large Language Model Annotations By Anne Lundgaard Hansen
  4. Machine Learning Approaches for Improving Demand Forecasting Accuracy in Retail Supply Chains By Abdelfatah, Omar Sharafeldin Mohamed
  5. Debiasing LLMs by Fine-tuning By Zhenyu Gao; Wenxi Jiang; Yutong Yan
  6. Effect of Cigarette Price and Tax Increases on Smoking in Europe: A Difference-in-Differences Study with Double Machine Learning By Andreas Stoller; Martin Huber

  1. By: Helal, Al Mansor; Hiraki, Ryotaro; Patrinos, Harry Anthony
    Abstract: This study examines the economic returns to education in the U.S. using 2024 CPS data and compares Ordinary Least Squares (OLS) regression with a Double Machine Learning (DML) framework incorporating models such as random forests, boosted trees, lasso, GAMs, and neural networks (MLP). Results show consistent returns of 8 to 9 percent per additional year of schooling across methods. Simulations reveal that all predictors perform well under linear assumptions if hyperparameters are optimally adjusted, while OLS/Lasso suffer from nonlinearity. Findings suggest that OLS remains robust in low-dimensional, near-linear contexts, offering practical guidance for economists and policymakers balancing model complexity and interpretability in education research.
    Keywords: Returns to education, Machine learning
    JEL: I20 J31 J24 D62 O15
    Date: 2026
    URL: https://d.repec.org/n?u=RePEc:zbw:glodps:1733
  2. By: Sylvain K. Assienin (UJloG - Université Jean Lorougnon Guédé); Auguste K. Kouakou (UJloG - Université Jean Lorougnon Guédé); Christian K. Nda (UJloG - Université Jean Lorougnon Guédé); Loukou L. E. Yobouet (Université Alassane Ouattara [Bouaké, Côte d'Ivoire])
    Abstract: The aim of this paper is to analyse the impact of school governance on learner performance in Sub-Saharan Africa, in the face of persistent low performance in the region, revealed by the PASEC 2019 report. The study uses an econometric model followed by machine learning models (Regression Logistic, Random Forest, Extra Tress Classifier, Extreme Gradient Boosting, Artificial Neural Networks) to explore the relationships between school results and governance factors measured by school management, pedagogical practices and relations with stakeholders. The results show that artificial neural network models perform better than conventional approaches in terms of accuracy and explainability. Explainability by Shapley values shows that the quality of administrative and pedagogical management, benevolent school-student relations, and activities to promote the best students significantly improve performance. The study suggests capacity building for managers in order to improve the quality of administrative and pedagogical management. It also highlights the need to promote rigorous administrative governance, based on effective practices and adapted to local realities. In addition, specific strategies should be put in place to reward high-performing students, while encouraging professional collaboration between education stakeholders. Finally, a review of parental involvement practices is recommended in order to avoid inappropriate expectations likely to be detrimental to learners' performance.
    Keywords: Shapley values, neural networks, school performance, school governance
    Date: 2025–08–31
    URL: https://d.repec.org/n?u=RePEc:hal:journl:hal-05547822
  3. By: Anne Lundgaard Hansen
    Abstract: This paper proposes a validation framework for LLM-generated measurements when reliable benchmarks are unavailable. Validity is established by testing whether an LLM can reconstruct passages from annotated labels while maintaining semantic consistency with the original text. The framework avoids circular reasoning by establishing testable prerequisite properties that must be met for a validation to be considered successful. Application to news article data demonstrates that the framework serves as a practical alternative to human benchmarking, which offers advantages in objectivity, scalability, and cost-effectiveness while identifying cases where LLMs capture economic meaning that human evaluators miss.
    Keywords: Economic measurement; Machine learning; Unstructured data; Sentiment; Computational techniques
    JEL: C18 C45 C80
    Date: 2026–03–30
    URL: https://d.repec.org/n?u=RePEc:fip:fedgfe:103001
  4. By: Abdelfatah, Omar Sharafeldin Mohamed
    Abstract: Accurate demand forecasting remains one of the most critical yet persistently challenging functions in retail supply chain management. Traditional statistical forecasting methods such as ARIMA and exponential smoothing have long served as industry standards; however, their limited capacity to capture nonlinear demand patterns, seasonal volatility, and external market signals has prompted growing interest in machine learning (ML) alternatives. This study investigates the comparative effectiveness of multiple ML approaches including Random Forest, Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM) neural networks, and hybrid ensemble models against traditional baseline methods in the context of retail supply chain demand forecasting. Employing a quantitative research design, the study utilizes a panel dataset comprising 36 months of point-of-sale (POS) transaction records, promotional calendars, macroeconomic indicators, and weather data from 14 retail organizations operating across grocery, fashion, and consumer electronics segments. Forecasting accuracy is evaluated using Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Forecast Bias metrics across multiple product categories and forecasting horizons (1-week, 4-week, and 12-week ahead). Results demonstrate that ensemble ML models particularly hybrid LSTM-XGBoost architectures achieve statistically significant improvements in forecasting accuracy over traditional methods, with MAPE reductions averaging 28.6% at the 4-week horizon. Feature importance analysis identifies promotional activity, competitor pricing signals, and lagged POS data as the most influential demand drivers. The study further reveals that ML forecasting benefits are heterogeneous across product categories, with highest gains 2 observed in high-velocity, promotion-sensitive SKUs and smallest gains in slow-moving, low-volatility items. A practical implementation framework is proposed, offering retail supply chain practitioners a structured pathway from data readiness assessment through model deployment and ongoing performance monitoring.
    Date: 2026–04–03
    URL: https://d.repec.org/n?u=RePEc:osf:socarx:4z9be_v1
  5. By: Zhenyu Gao; Wenxi Jiang; Yutong Yan
    Abstract: Prior research shows that large language models (LLMs) exhibit systematic extrapolation bias when forming predictions from both experimental and real-world data, and that prompt-based approaches appear limited in alleviating this bias. We propose a supervised fine-tuning (SFT) approach that uses Low-Rank Adaptation (LoRA) to train off-the-shelf LLMs on instruction datasets constructed from rational benchmark forecasts. By intervening at the parameter level, SFT changes how LLMs map observed information into forecasts and thereby mitigates extrapolation bias. We evaluate the fine-tuned model in two settings: controlled forecasting experiments and cross-sectional stock return prediction. In both settings, fine-tuning corrects the extrapolative bias out-of-sample, establishing a low-cost and generalizable method for debiasing LLMs.
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.02921
  6. By: Andreas Stoller; Martin Huber
    Abstract: We estimate the effect of cigarette price and tax increases on smoking rates using Eurobarometer survey data from 27 European Union countries between 2012 and 2020. Following a difference-in-differences approach, we compare individuals exposed to large price and tax increases with those in stable price and tax environments. Estimation is based on a difference-in-differences estimator with double machine learning, which relaxes the functional form assumptions typically imposed by parametric approaches such as two-way fixed effects. Our results indicate that tax increases reduce smoking rates among individuals who smoke at least once per month and among daily smokers. The reduction is primarily driven by individuals aged 15-24. We examine the sensitivity of our findings to functional form assumptions and treatment definitions. While estimates are robust to alternative functional form assumptions, they are sensitive to whether the treatment is defined as binary or continuous.
    Date: 2026–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2604.05841

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.