nep-big New Economics Papers
on Big Data
Issue of 2026–03–16
eight papers chosen by
Tom Coupé, University of Canterbury


  1. AI-Powered Skill Classification: Mapping Technology Intensity in the German Labor Market By Grenz, Sabrina; Gregory, Terry; Lehmer, Florian
  2. RAUI: Uncertainty Indicators Built With Artificial Intelligence By Morteza Ghomi; Samuel Hurtado
  3. Beyond Polarity: Multi-Dimensional LLM Sentiment Signals for WTI Crude Oil Futures Return Prediction By Dehao Dai; Ding Ma; Dou Liu; Kerui Geng; Yiqing Wang
  4. Double Machine Learning for Time Series By Milos Ciganovic; Federico D'Amario; Massimiliano Tancioni
  5. The Content Moderator’s Dilemma: Removal of Toxic Content and Distortions to Online Discourse By Habibi, Mahyar; Hovy, Dirk; Schwarz, Carlo
  6. Who Shirks at Work? An Application of Machine Learning to Time Use Data By Giménez-Nadal, José Ignacio; Molina, José Alberto; Velilla, Jorge
  7. DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining By Yutong Yan; Raphael Tang; Zhenyu Gao; Wenxi Jiang; Yao Lu
  8. Banning Mobile Phones in Schools: A Comprehensive Analysis of Media Coverage Across Countries By Kharazi Sopho; Sala Arianna; Bostelmann Gerrit; Kotseva Bonka

  1. By: Grenz, Sabrina (Utrecht University); Gregory, Terry (LISER); Lehmer, Florian (IAB Nueremberg)
    Abstract: The rapid evolution of technology is reshaping labor markets by altering skill demands and job profiles. This paper introduces a novel skill-based measure of occupational technology intensity -- the Occupational Technology Skill Share (OTSS) -- that distinguishes between manual, digital, and frontier technologies. Using natural language processing, generative AI, and supervised machine learning, we develop an AI-powered skill classification that enriches occupation-linked skill labels with standardized GenAI-generated descriptions and structured indicators of technological content, enabling transparent classification by technology intensity. We compute OTSS for all occupations in the German labor market. For the average worker in 2023, manual technologies account for the largest share of skill content (42\%), followed by digital (38\%) and frontier technologies (20\%). Frontier technologies remain concentrated in specialized occupations, while digital technologies are widespread. Linking these measures to administrative data from 2012–2023 shows a broad shift from manual and digital toward frontier skills across occupations, and reveals a U-shaped relationship between changes in frontier skill intensity and employment growth.
    Keywords: artificial intelligence, digitalization, skills, employment growth
    JEL: J21 J24 O33
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp18415
  2. By: Morteza Ghomi (BANCO DE ESPAÑA); Samuel Hurtado (BANCO DE ESPAÑA)
    Abstract: We present a methodology for generating uncertainty indicators for user-defined topics based on newspaper data. The approach is based on Retrieval-Augmented Generation (RAG) systems commonly used in artificial intelligence applications, which we adapt to construct topic-specific uncertainty measures, referred to as Retrieval-Augmented Uncertainty Indicators (RAUI). The method employs semantic search with an embedding model to select news articles relevant to a given topic, and a large language model (LLM) to quantify the level of uncertainty contained in each of those articles. We construct uncertainty indicators for ten topics using Spanish newspaper data and an aggregate measure that also highlights how each topic contributes to overall uncertainty. We present two practical applications of these indicators: a VAR analysis that shows how different sources of uncertainty have different effects on the Spanish economy, and an estimation that generates time-varying fan charts around the Banco de España GDP growth projections.
    Keywords: uncertainty, artificial intelligence, natural language processing, newspapers
    JEL: C81 E32
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:bde:wpaper:2609
  3. By: Dehao Dai; Ding Ma; Dou Liu; Kerui Geng; Yiqing Wang
    Abstract: Forecasting crude oil prices remains challenging because market-relevant information is embedded in large volumes of unstructured news and is not fully captured by traditional polarity-based sentiment measures. This paper examines whether multi-dimensional sentiment signals extracted by large language models improve the prediction of weekly WTI crude oil futures returns. Using energy-sector news articles from 2020 to 2025, we construct five sentiment dimensions covering relevance, polarity, intensity, uncertainty, and forwardness based on GPT-4o, Llama 3.2-3b, and two benchmark models, FinBERT and AlphaVantage. We aggregate article-level signals to the weekly level and evaluate their predictive performance in a classification framework. The best results are achieved by combining GPT-4o and FinBERT, suggesting that LLM-based and conventional financial sentiment models provide complementary predictive information. SHAP analysis further shows that intensity- and uncertainty-related features are among the most important predictors, indicating that the predictive value of news sentiment extends beyond simple polarity. Overall, the results suggest that multi-dimensional LLM-based sentiment measures can improve commodity return forecasting and support energy-market risk monitoring.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.11408
  4. By: Milos Ciganovic; Federico D'Amario; Massimiliano Tancioni
    Abstract: We modify the Double Machine Learning estimator to broaden its applicability to macroeconomic time-series settings. A deterministic cross-fitting step, termed Reverse Cross-Fitting, leverages the time-reversibility of stationary series to improve sample utilization and efficiency. We detail and prove the conditions under which the estimator is asymptotically valid. We then demonstrate, through simulations, that its performance remains valid in realistic finite samples and is robust to model misspecification and violations of assumptions, such as heteroskedasticity. In high dimensions, predictive metrics for tuning nuisance learners do not generally minimize bias in the causal score. We propose a calibration rule targeting a "Goldilocks zone", a region of tuning parameters that delivers stable, partialled-out signals and reduced small-sample bias. Finally, we apply our procedure to residualized Local Projections to estimate the dynamic effects of a rise in Tier 1 regulatory capital. The results underscore the usefulness of the methodology for inference in macroeconomic applications.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.10999
  5. By: Habibi, Mahyar (Department of Economics, Bocconi University); Hovy, Dirk (Department of Computing Sciences, Bocconi University); Schwarz, Carlo (Department of Economics, Bocconi University)
    Abstract: There is an ongoing debate about how to moderate toxic speech on social media and the impact of content moderation on online discourse. This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics. Applying the method to a representative sample of 5 million US political Tweets, we find that removing toxic Tweets significantly alters the semantic composition of content. The magnitudes of the distortions are comparable to removing 4 out of 67 topics from the online discourse at random. This finding is consistent across different embedding models, toxicity metrics, and samples. Importantly, we demonstrate that these effects are not solely driven by toxic language but by the removal of topics often expressed in toxic form. We propose an alternative approach to content moderation that uses generative Large Language Models to rephrase toxic Tweets, preserving their salvageable content rather than removing them entirely. We show that this rephrasing strategy reduces toxicity while mitigating distortions in online content.
    Keywords: social media, content moderation, content distortions, toxicity, embeddings JEL Classification:
    Date: 2026
    URL: https://d.repec.org/n?u=RePEc:cge:wacage:793
  6. By: Giménez-Nadal, José Ignacio (University of Zaragoza); Molina, José Alberto (University of Zaragoza); Velilla, Jorge (University of Zaragoza)
    Abstract: Worker productivity depends not only on hours worked, but also on how work time is actually used, and time-use evidence shows that non-work at work is non-trivial. This paper provides a data-driven characterization of shirking, and studies which observable characteristics best predict shirking behavior using American Time Use Survey data over 2003–2024. We implement a machine-learning forward selection procedure based on out-of-sample predictive performance. Our results suggest that shirking strongly depends on stochastic or unobserved factors, and that the determinants of the extensive and intensive margins are different. Moreover, the most informative predictors are predominantly job-related and time-allocation variables, whereas macro and labor-market indicators seem less relevant. This suggests that policies or managerial approaches to improve worker efficiency relying on observables face important limitations.
    Keywords: shirking, non-work at work, ATUS data, prediction
    JEL: J22 C53
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp18432
  7. By: Yutong Yan; Raphael Tang; Zhenyu Gao; Wenxi Jiang; Yao Lu
    Abstract: In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.11838
  8. By: Kharazi Sopho; Sala Arianna (European Commission - JRC); Bostelmann Gerrit; Kotseva Bonka
    Abstract: This study presents a comprehensive analysis of the media coverage surrounding the topic of mobile phone bans in schools. The report encompasses a thorough quantitative examination of a large dataset comprising over 21, 000 articles from both mainstream and unverified media sources. The research investigates the overall trends in reporting during the specified timeframe, as well as the timeline distribution and heatmap intensities of framing dimensions and persuasion techniques per source country and across the top 30 clusters. Furthermore, the top shared articles from unverified sources on Facebook are analysed, providing insight into the role of social media in shaping public opinion on mobile phone bans in educational settings. The findings of this study have significant implications for policymakers, highlighting the complexities of public discourse and the influence of media on opinions regarding mobile phone regulation in schools. By contributing to a deeper understanding of the mobile phone ban debate, our research informs the development of effective policies to promote a healthier and more focused learning environment for students.
    Date: 2026–02
    URL: https://d.repec.org/n?u=RePEc:ipt:iptwpa:jrc143863

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.