nep-big New Economics Papers
on Big Data
Issue of 2025–12–08
eleven papers chosen by
Tom Coupé, University of Canterbury


  1. Evolution of Spatial Drivers for Oil Palm Expansion over Time: Insights from Spatiotemporal Data and Machine Learning Models By Zhao, Jing; Cochrane, Mark; Zhang, Xin; Elmore, Andrew; Lee, Janice; Su, Ye
  2. Misaligned by Design: Incentive Failures in Machine Learning By David Autor; Andrew Caplin; Daniel J. Martin; Philip Marx
  3. Satellite Data in Agricultural and Environmental Economics: Theory and Practice By Wüpper, David; Oluoch, Wyclife Agumba; Hadi
  4. Tracking Trends in Topics of Agricultural and Applied Economics Discourse over the Last Century Using Natural Language Processing By Lee, Jacob W.; Elliott, Brendan; Lam, Aaron; Gupta, Neha; Wilson, Norbert L.W.; Collins, Leslie M.; Mainsah, Boyla
  5. Underlying inflation measures for Germany By Ciftci, Muhsin; Wieland, Elisabeth
  6. Socioeconomic Inequality in Longevity: A Multidimensional Approach By Paul Bingley; Claus Thustrup Kreiner; Benjamin Ly Serena
  7. Engagement vs. Commitment: The Economic Trade-Offs of Polarizing News Content By Yan, Shunyao; Miller, Klaus M.
  8. What Do LLMs Want? By Thomas R. Cook; Sophia Kazinnik; Zach Modig; Nathan M. Palmer
  9. From Tweets to Returns: Validating LLM-Based Sentiment Signals in Energy Stocks By Sarra Ben Yahia; Jose Angel Garcia Sanchez; Rania Hentati Kaffel
  10. Text Sentiment About Monetary Policy By Hie Joo Ahn; Thomas R. Cook; Taeyoung Doh; Elias Kastritis; Jesse Wedewer
  11. Inflation narratives and expectations By Trebbi, Giovanni

  1. By: Zhao, Jing; Cochrane, Mark; Zhang, Xin; Elmore, Andrew; Lee, Janice; Su, Ye
    Keywords: Land Economics/Use, Environmental Economics and Policy, Community/Rural/Urban Development
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea24:344016
  2. By: David Autor; Andrew Caplin; Daniel J. Martin; Philip Marx
    Abstract: The cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Accordingly, artificial intelligence (AI) models used to assist such decisions are frequently trained with asymmetric loss functions that incorporate human decision-makers' trade-offs between false positives and false negatives. In two focal applications, we show that this standard alignment practice can backfire. In both cases, it would be better to train the machine learning model with a loss function that ignores the human’s objective and then adjust predictions ex post according to that objective. We rationalize this result using an economic model of incentive design with endogenous information acquisition. The key insight from our theoretical framework is that machine classifiers perform not one but two incentivized tasks: choosing how to classify and learning how to classify. We show that while the adjustments engineers use correctly incentivize choosing, they can simultaneously reduce the incentives to learn. Our formal treatment of the problem reveals that methods embraced for their intuitive appeal can in fact misalign human and machine objectives in predictable ways.
    JEL: C1 D8
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34504
  3. By: Wüpper, David; Oluoch, Wyclife Agumba; Hadi
    Abstract: Agricultural and environmental economists are in the fortunate position that a lot of what is happening on the ground is observable from space. Most agricultural production happens in the open and one can see from space when and where innovations are adopted, crop yields change, or forests are converted to pastures, to name just a few examples. However, converting images into measurements of a particular variable is not trivial, as there are more pitfalls and nuances than “meet the eye”. Overall, however, research benefits tremendously from advances in available satellite data as well as complementary tools, such as cloud-based platforms for data processing, and machine learning algorithms to detect phenomena and mapping variables. The focus of this keynote is to provide agricultural and environmental economists with an accessible introduction to working with satellite data, show-case applications, discuss advantages and weaknesses of satellite data, and emphasize best practices. This is supported by extensive Supplementary Materials, explaining the technical foundations, describing in detail how to create different variables, sketch out work flows, and a discussion of required resources and skills. Last but not least, example data and reproducible codes are available online.
    Keywords: Environmental Economics and Policy, Research Methods/Statistical Methods
    Date: 2024–07–26
    URL: https://d.repec.org/n?u=RePEc:ags:iaae24:344359
  4. By: Lee, Jacob W.; Elliott, Brendan; Lam, Aaron; Gupta, Neha; Wilson, Norbert L.W.; Collins, Leslie M.; Mainsah, Boyla
    Keywords: Research Methods/Statistical Methods, Agricultural and Food Policy, Teaching/Communication/Extension/Profession
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea24:343814
  5. By: Ciftci, Muhsin; Wieland, Elisabeth
    Abstract: In this paper, we evaluate a set of measures of underlying inflation for Germany using conventional measures, such as core inflation (excluding energy and food items), and alternative measures based on econometric models, machine learning, and micro-price evidence. We compare these measures through detailed in-sample and out-of-sample evaluations. The alternative measures exhibit lower volatility, minimal bias, and superior out-of-sample forecasting accuracy performance. While we find no evidence that any single measure clearly outperforms the others over time, the range of alternatives measures also reflects a somewhat earlier uptick and downturn in light of the recent inflation surge in comparison to traditional ones. In addition, all measures under consideration are highly sensitive to monetary policy shocks.
    Keywords: Underlying inflation, monetary policy, local projections, machine learning
    JEL: E31 E37 C22
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:zbw:bubtps:333424
  6. By: Paul Bingley (The Danish Center for Social Science Research); Claus Thustrup Kreiner (Department of Economics, University of Copenhagen); Benjamin Ly Serena (The ROCKWOOL Foundation Research Unit)
    Abstract: Socioeconomic inequality in longevity is typically measured using a single socioeconomic indicator such as education or income. We combine multiple indicators—education, income, occupation, wealth, and IQ scores—and apply machine learning to measure inequality in longevity. Using Danish population-wide data spanning 40 years, we track mortality for the 1942–44 birth cohorts from age 40 onwards to estimate life expectancy by socioeconomic status. Individuals at the top of the socioeconomic distribution live nearly 25 years longer than those at the bottom. The socioeconomic gradient in life expectancy becomes 50–150% steeper when using multiple indicators.
    Keywords: Life Expectancy, Inequality, Machine Learning
    JEL: I14
    Date: 2025–12–01
    URL: https://d.repec.org/n?u=RePEc:kud:kucebi:2514
  7. By: Yan, Shunyao (Santa Clara University - Marketing); Miller, Klaus M. (HEC Paris)
    Abstract: We study how polarizing content shapes two economic outcomes on a major European news website: engagement (time on site) and commitment (paid subscriptions). Using advances in natural language processing, we construct deep-learning and large-language-model-based textual measures of polarization tailored to a multiparty system. We combine comprehensive supply and demand data-the full publisher-wide article inventory with user-level clicks and subscription outcomes-to track how consumers interact with polarizing articles. To identify causal effects, we use two theoretically distinct instruments: (i) a Bartik-style design that interacts users' stable topic preferences with weekly shifts in the supply of polarizing content; and (ii) an election shock that raises political salience for a subset of readers. We document a "polarization trap": exogenous increases in exposure to polarizing content raise engagement (time on site) but reduce the probability of subscribing. The negative subscription effect is driven more by the affective than the ideological dimension of polarization and is strongest during high-salience political periods. These results imply a strategic trade-off for publishers: content that maximizes short-run attention can undermine the formation of a loyal, paying subscriber base.
    Keywords: Polarization; Subscriptions; Online Media; News Consumption; Instrumental Variables; Natural Language Processing
    JEL: M00
    Date: 2025–10–06
    URL: https://d.repec.org/n?u=RePEc:ebg:heccah:1585
  8. By: Thomas R. Cook; Sophia Kazinnik; Zach Modig; Nathan M. Palmer
    Abstract: Large language models (LLMs) are now used for economic reasoning, but their implicit "preferences” are poorly understood. We study LLM preferences as revealed by their choices in simple allocation games and a job-search setting. Most models favor equal splits in dictator-style allocation games, consistent with inequality aversion. Structural estimates recover Fehr–Schmidt parameters that indicate inequality aversion is stronger than in similar experiments with human participants. However, we find these preferences are malleable: reframing (e.g., masking social context) and learned control vectors shift choices toward payoff-maximizing behavior, while personas move them less effectively. We then turn to a more complex economic scenario. Extending a McCall job search environment, we also recover effective discounting from accept/reject policies, but observe that model responses may not always be rationalizable, and in some cases suggest inconsistent preferences. Efforts to steer LLM responses in the McCall scenario are also less consistent. Together, our results suggest (i) LLMs exhibit latent preferences that may not perfectly align with typical human preferences and (ii) LLMs can be steered toward desired preferences, though this is more difficult with complex economic tasks.
    Keywords: large language models; Simulation modeling
    JEL: C63 C68 C61 D14 D83 D91 E20 E21
    Date: 2025–11–25
    URL: https://d.repec.org/n?u=RePEc:fip:fedkrw:102166
  9. By: Sarra Ben Yahia (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique); Jose Angel Garcia Sanchez (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique); Rania Hentati Kaffel (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique)
    Abstract: Our research assesses the predictive value of LLM-based sentiment in forecasting energy stock returns. Using FinBERT-derived sentiment indicators from 415, 193 tweets spanning 2018-2024, we find statistically significant causal relationships for 80% of companies analyzed. Our VAR analysis reveals heterogeneous optimal lag structures ranging from 2 to 14 days, providing econometric evidence against semi-strong market efficiency. Our results show that the accuracy of the forecast depends critically on the quality and coverage of the data. Our contribution is twofold: (i) a scalable LLMdriven pipeline to quantify firm-level sentiment at daily frequency, and (ii) an econometric validation via VAR/Granger that uncovers economically meaningful lead-lag patterns
    Keywords: sentiment analysis, LLM, FinBERT, energy equity markets, Twitter/X sentiment, return forecasting, webscraping, information diffusion, information extraction, finBERT, financial NLP, VAR
    Date: 2025–09–30
    URL: https://d.repec.org/n?u=RePEc:hal:cesptp:hal-05312326
  10. By: Hie Joo Ahn; Thomas R. Cook; Taeyoung Doh; Elias Kastritis; Jesse Wedewer
    Abstract: This paper uses text data from Federal Open Market Committee (FOMC) meeting transcripts to estimate the reference levels of full employment, inflation, and financial conditions perceived by voting members and to uncover time variation in the Taylor rule parameters. We construct topic dictionaries on economic slack, inflation, and financial markets, and infer reference levels from members’ sentiment using a state-space model. The estimated employment reference level indicates that FOMC voting members generally perceived the labor market as tighter than implied by the Congressional Budget Office’s estimates between the mid-1980s and early 2000s, whereas the two measures align closely during the Great Recession and its subsequent recovery. The members’ perceived inflation target varies widely in the 1970s and 1980s, trends downward in the 1990s, and stabilizes slightly below two percent thereafter. The estimated Taylor rule exhibits shifting policy weights over time—stronger emphasis on inflation stabilization before the mid-1990s, greater responsiveness to employment deviations thereafter, and renewed emphasis on the inflation trend following the Great Recession—while interest-rate smoothing remains substantial throughout.
    Keywords: Federal Open Market Committee (FOMC); Taylor rule; Federal Reserve monetary policy; sentiment
    JEL: C32 E43 E52 E58
    Date: 2025–11–25
    URL: https://d.repec.org/n?u=RePEc:fip:fedkrw:102162
  11. By: Trebbi, Giovanni
    Abstract: I study how demand-supply narrative disagreement between general and specialized newspapers can explain households’ absolute gap in inflation expectations with experts. I measure inflation narratives via a Causality Extraction algorithm that can identify causal relationships between events in a text and, hence, extract the perceived triggers of inflation. Causal relations can explain why narratives affect people’s beliefs and cannot be captured by dictionary methods, topic models, and word embeddings. I then classify inflation narratives into demand and supply narratives based on their focus on demand and supply triggers. I measure narrative disagreement between general and specialized newspapers from their attention difference on demand and supply narratives. The absolute expectation gap widens when narrative disagreement increases, especially for non-college-educated and older households. Unlike the narratives of specialized newspapers, the narratives of general newspapers incorrectly align with experts’ demand-supply views. JEL Classification: C53, D1, D8, E3
    Keywords: causality extraction, natural language processing, news media
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:ecb:ecbwps:20253158

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.