nep-big New Economics Papers
on Big Data
Issue of 2026–02–23
five papers chosen by
Tom Coupé, University of Canterbury


  1. Optimal Audit Targeting with Machine Learning: Evidence from Pakistan By Nicholas Lacoste; Zehra Farooq
  2. Predicting Corporate ESG Scores from Financial Performance and Environmental Indicators: A Machine Learning Framework By Chouech, Olfa
  3. Forecasting European Sovereign Spreads using Machine Learning By Bouillot, Roland; Candelon, Bertrand; Kool, Clemens
  4. Predicting Well-Being with Mobile Phone Data: Evidence from Four Countries By M. Merritt Smith; Emily Aiken; Joshua E. Blumenstock; Sveta Milusheva
  5. Keep it simple, stupid!: The determinants of language complexity in politicians' parliamentary and online communication By Kittel, Rebecca; Silva, Bruno Castanho

  1. By: Nicholas Lacoste (Tulane University); Zehra Farooq (Federal Board of Revenue, Pakistan)
    Abstract: This paper bridges welfare economics and machine learning econometrics to develop empirically implementable algorithms for optimal audit targeting. We derive a sufficient statistic-based targeting algorithm that depends on three individualized causal effects: the immediate revenue recovered from an audit, the causal effect of an audit on long-run tax revenue, and the marginal administrative cost of an audit. We estimate these effects with a variety of machine learners comparing causal forests, LASSO, gradient boosted trees, and neural networks using the universe of Pakistani income tax returns, exploiting years in which audits were assigned completely at random. We implement our targeting algorithms in out-of-bag years, comparing them to the real-world policy when audits were partially or entirely targeted. We show that the real world audit program in Pakistan lost almost 173, 000 Rs ($1, 700) in net revenue per-audit, while our optimal policy generates 285, 000 Rs ($2, 800) in expected net revenue per-audit. We also find that targeting audits based on immediate recoup is sub-optimal to targeting on long-run deterrence in this setting. Moving forward, our framework offers a general approach to empirical welfare maximization using machine learning in resource-constrained policy settings.
    Keywords: optimal audit policy, tax enforcement, machine learning, sufficient statistics
    JEL: H21 H26 C14 C45
    Date: 2026–02
    URL: https://d.repec.org/n?u=RePEc:tul:wpaper:2603
  2. By: Chouech, Olfa
    Abstract: As investors, regulators, and the public increasingly emphasize sustainable investment amid growing climate concerns, the accurate prediction of Environmental, Social, and Governance (ESG) metrics has become a crucial complement to traditional assessment methods. This study analyzes 1, 000 companies across nine industries and seven regions between 2015 and 2025 to predict overall ESG scores using key financial and environmental indicators. To ensure robust predictive performance, a diverse set of machine learning algorithms—including Linear Regression, Random Forests, and four boosting models (AdaBoost, LightGBM, XGBoost, and CatBoost)—was employed. To address potential bias in panel data, a panel-aware machine learning framework incorporating GroupKFold cross-validation was implemented. The results show that boosting algorithms consistently outperform traditional linear approaches in predicting ESG scores. Among them, CatBoost achieved the best overall performance, with the lowest RMSE (4.608), MAE (2.222), and MSE (21.234), and the highest R² (0.913), indicating strong predictive accuracy. Overall, this study presents an innovative and transferable framework for predicting ESG scores, thus contributing to both empirical research and quantitative modeling practices. Furthermore, it advances the sustainability field by providing a machine learning–based application that enables companies to predict their ESG scores in real time.
    Keywords: ESG, Machine Learning, Boosting Algorithms, Sustainable Development, Predictive Modeling
    JEL: O32 Q55 Q56
    Date: 2025–09–01
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:127272
  3. By: Bouillot, Roland (Maastricht University); Candelon, Bertrand (Université catholique de Louvain, LIDAM/LFIN, Belgium); Kool, Clemens (Maastricht University)
    Abstract: Accurate forecasting constitutes a central objective for policymakers. This paper examines the application of advanced machine-learning techniques to predict the 10-year sovereign bond spreads vis-à-vis the German bund, employing a novel high-dimensional dataset covering 10 European countries over the period 2007−2025. An exhaustive comparison of predictive performance, both in-sample and out-of-sample, demonstrates that XGBoost delivers the highest degree of accuracy. Building on these forecasts, we construct fragmentation matrices that capture the extent of asymmetry across Euro area sovereign bond markets. Prior to the COVID-19 crisis, results confirm the well-documented clustering between core and peripheral countries. However, since 2021 this segmentation appears to have weakened, as French and Belgian spreads exhibit a synchronous trajectory. Thesefindingscontribute totheliterature on financialintegrationand fragmentation within the Euro area, offering new insights into the evolving dynamics of sovereign bond markets.
    Keywords: Machine learning ; Financial fragmentation risk ; XGBoost ; Sovereign spreads
    Date: 2025–11–30
    URL: https://d.repec.org/n?u=RePEc:ajf:louvlf:2025004
  4. By: M. Merritt Smith; Emily Aiken; Joshua E. Blumenstock; Sveta Milusheva
    Abstract: We provide systematic evidence on the potential for estimating household well-being from mobile phone data. Using data from four countries - Afghanistan, Cote d'Ivoire, Malawi, and Togo - we conduct parallel, standardized machine learning experiments to assess which measures of welfare can be most accurately predicted, which types of phone data are most useful, and how much training data is required. We find that long-term poverty measures such as wealth indices (Pearson's rho = 0.20-0.59) and multidimensional poverty (rho = 0.29-0.57) can be predicted more accurately than consumption (rho = 0.04 - 0.54); transient vulnerability measures like food security and mental health are very difficult to predict. Models using calls and text message behavior are more predictive than those using metadata on mobile internet usage, mobile money transactions, and airtime top-ups. Predictive accuracy improves rapidly through the first 1, 000-2, 000 training observations, with continued gains beyond 4, 500 observations. Model performance depends strongly on sample heterogeneity: nationally-representative samples yield 20-70 percent higher accuracy than urban-only or rural-only samples.
    Date: 2026–02
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2602.02805
  5. By: Kittel, Rebecca; Silva, Bruno Castanho
    Abstract: Politicians can adjust the complexity of their communication to signal different things to different audiences: more complex language can indicate competence, while lower complexity may bring them closer to "common people". These strategic shifts in complexity, however, remain understudied. We ask what individual and contextual factors relate to politicians' use of more or less complex language in their communication, with a dataset matching 116, 000 parliamentary speeches from 15 countries with 800, 000 contemporaneous Facebook posts from the same MPs between 2018 and 2021, and apply measures of language complexity to each. Results show that women use more complex language in parliament, and that far-right politicians, while similar to others in parliamentary speech, simplify their language the most on social media, and benefit the most from higher engagement with their simpler posts. These results show new dimensions of how politicians strategically adapt their communication styles to the audience.
    Keywords: Language Complexity, Strategic Communication, Parliamentary Discourse, Social Media, Political Communication
    Date: 2026
    URL: https://d.repec.org/n?u=RePEc:zbw:wzbccs:336795

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.