nep-big 2026-02-02 papers

on Big Data

Issue of 2026–02–02
eight papers chosen by
Tom Coupé, University of Canterbury

Using Natural Language Processing to Identify Sentiment of Green Investors By Janda, Karel; Rozsahegyi, Marketa; Quang Van Tran; Zhang, Binyi
Can Large Language Models Improve Venture Capital Exit Timing After IPO? By Mohammadhossien Rashidi
Pattern Recognition of Ozone-Depleting Substance Exports in Global Trade Data By Muhammad Sukri Bin Ramli
Web Reviews as a New Leading Indicator for Nowcasting Travel Expenditure in Balance of Payments Statistics By Oxana Babecka Kucharcukova; Jan Bruha; Petr Sterba
Large Language Models Polarize Ideologically but Moderate Affectively in Online Political Discourse By Gavin Wang; Srinaath Anbudurai; Oliver Sun; Xitong Li; Lynn Wu
Bank Runs With and Without Bank Failure By Sergio Correia; Stephan Luck; Emil Verner
Measuring Efficiency and Equity Framing in Economics Research: LLM-Based Evidence from 1950 to 2021 By Sebastian Galiani; Ramiro H. Gálvez; Franco Mettola La Giglia; Raul A. Sosa
Artificial Intelligence and Skills: Evidence from Contrastive Learning in Online Job Vacancies By Hangyu Chen; Yongming Sun; Yiming Yuan

Using Natural Language Processing to Identify Sentiment of Green Investors

By:	Janda, Karel; Rozsahegyi, Marketa; Quang Van Tran; Zhang, Binyi
Abstract:	This paper investigates the role of investor sentiment in the pricing and volatility dynamics of green bond exchange-traded funds (ETFs). The paper combines verbal description with a literature review, and it does not engage in actual data-based research analysis. While the literature on sentiment finance and ESG investing has expanded rapidly, empirical evidence focusing on fixed-income ESG instruments remains limited. We address this gap by employing modern natural language processing (NLP) techniques to construct sentiment indicators derived from news coverage and sustainability-related textual information. These indicators may be used to examine their impact on returns and volatility of selected green bond ETFs. By combining behavioural finance insights with state-of-the-art NLP methods, the paper contributes to sustainable finance research and highlights the informational role of textual data in green financial markets.
Keywords:	NLP model, ESG, Exchange Traded Funds
JEL:	C45 C55 G11 G17
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:zbw:esprep:335572

Can Large Language Models Improve Venture Capital Exit Timing After IPO?

By:	Mohammadhossien Rashidi
Abstract:	Exit timing after an IPO is one of the most consequential decisions for venture capital (VC) investors, yet existing research focuses mainly on describing when VCs exit rather than evaluating whether those choices are economically optimal. Meanwhile, large language models (LLMs) have shown promise in synthesizing complex financial data and textual information but have not been applied to post-IPO exit decisions. This study introduces a framework that uses LLMs to estimate the optimal time for VC exit by analyzing monthly post IPO information financial performance, filings, news, and market signals and recommending whether to sell or continue holding. We compare these LLM generated recommendations with the actual exit dates observed for VCs and compute the return differences between the two strategies. By quantifying gains or losses associated with following the LLM, this study provides evidence on whether AI-driven guidance can improve exit timing and complements traditional hazard and real-options models in venture capital research.
Date:	2025–12
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2601.00810

Pattern Recognition of Ozone-Depleting Substance Exports in Global Trade Data

By:	Muhammad Sukri Bin Ramli
Abstract:	New methods are needed to monitor environmental treaties, like the Montreal Protocol, by reviewing large, complex customs datasets. This paper introduces a framework using unsupervised machine learning to systematically detect suspicious trade patterns and highlight activities for review. Our methodology, applied to 100, 000 trade records, combines several ML techniques. Unsupervised Clustering (K-Means) discovers natural trade archetypes based on shipment value and weight. Anomaly Detection (Isolation Forest and IQR) identifies rare "mega-trades" and shipments with commercially unusual price-per-kilogram values. This is supplemented by Heuristic Flagging to find tactics like vague shipment descriptions. These layers are combined into a priority score, which successfully identified 1, 351 price outliers and 1, 288 high-priority shipments for customs review. A key finding is that high-priority commodities show a different and more valuable value-to-weight ratio than general goods. This was validated using Explainable AI (SHAP), which confirmed vague descriptions and high value as the most significant risk predictors. The model's sensitivity was validated by its detection of a massive spike in "mega-trades" in early 2021, correlating directly with the real-world regulatory impact of the US AIM Act. This work presents a repeatable unsupervised learning pipeline to turn raw trade data into prioritized, usable intelligence for regulatory groups.
Date:	2025–11
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2512.07864

Web Reviews as a New Leading Indicator for Nowcasting Travel Expenditure in Balance of Payments Statistics

By:	Oxana Babecka Kucharcukova; Jan Bruha; Petr Sterba
Abstract:	This paper introduces a novel travel performance indicator derived from tourist reviews available online, utilizing text mining techniques. The time series generated is integrated as an explanatory variable into a small-scale empirical model of travel revenue and expenditure in the Czech Republic's balance of payments. The signiï¬ cance of online reviews for nowcasting is validated through various machine learning algorithms. The study also addresses empirical challenges, including trends in review data, the impact of the COVID-19 pandemic, and occasional methodological changes in ofï¬ cial statistical series, and outlines strategies to overcome these obstacles. The ï¬ ndings suggest that the proposed model is a valuable addition to the Czech National Bank's nowcasting framework. To the best of our knowledge, this is the ï¬ rst study to combine text analysis with nowcasting of a BoP item, speciï¬ cally travel services.
Keywords:	Balance of payments, text mining, travel services
JEL:	C53 C83 F17
Date:	2025–11
URL:	https://d.repec.org/n?u=RePEc:cnb:wpaper:2025/13

Large Language Models Polarize Ideologically but Moderate Affectively in Online Political Discourse

By:	Gavin Wang; Srinaath Anbudurai; Oliver Sun; Xitong Li; Lynn Wu
Abstract:	The emergence of large language models (LLMs) is reshaping how people engage in political discourse online. We examine how the release of ChatGPT altered ideological and emotional patterns in the largest political forum on Reddit. Analysis of millions of comments shows that ChatGPT intensified ideological polarization: liberals became more liberal, and conservatives more conservative. This shift does not stem from the creation of more persuasive or ideologically extreme original content using ChatGPT. Instead, it originates from the tendency of ChatGPT-generated comments to echo and reinforce the viewpoint of original posts, a pattern consistent with algorithmic sycophancy. Yet, despite growing ideological divides, affective polarization, measured by hostility and toxicity, declined. These findings reveal that LLMs can simultaneously deepen ideological separation and foster more civil exchanges, challenging the long-standing assumption that extremity and incivility necessarily move together.
Date:	2026–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2601.20238

Bank Runs With and Without Bank Failure

By:	Sergio Correia; Stephan Luck; Emil Verner
Abstract:	We study the causes and consequences of bank runs using a novel dataset on bank runs in the United States from 1863 to 1934. Applying natural language processing to historical newspapers, we identify 4, 049 runs on individual banks. Runs are considerably more likely in weak banks but also occur in strong banks, especially in response to negative news about the real economy or the broader banking system. However, runs typically only result in failure for banks with weak fundamentals. Strong banks survive runs through various mechanisms, including interbank cooperation, equity injections, public signals of strength, and suspension of convertibility. At the local level, bank failures (with and without runs) translate into substantially larger declines in deposits and lending than runs without failures. Our findings suggest that poor bank fundamentals are necessary for bank runs to translate into failure and for bank distress to generate severe economic consequences.
Date:	2026–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2601.20285

Measuring Efficiency and Equity Framing in Economics Research: LLM-Based Evidence from 1950 to 2021

By:	Sebastian Galiani; Ramiro H. Gálvez; Franco Mettola La Giglia; Raul A. Sosa
Abstract:	We measure how frontier research frames what is normatively at stake along the efficiency and equity dimension. We develop and validate an LLM-based measurement pipeline and apply it to 27, 464 full-text journal articles from 1950 to 2021. Efficiency focused framing rises through the late 1980s, then declines as equity related framing expands after 1990, especially in applied work and policy evaluations. By 2021, papers with an equity component are about as common as papers framed purely around efficiency. President transmittal letters in the Economic Report of the President show a similar post 1990 shift toward equity, providing an external benchmark.
JEL:	A14 B2 C8
Date:	2026–01
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:34714

Artificial Intelligence and Skills: Evidence from Contrastive Learning in Online Job Vacancies

By:	Hangyu Chen; Yongming Sun; Yiming Yuan
Abstract:	We investigate the impact of artificial intelligence (AI) adoption on skill requirements using 14 million online job vacancies from Chinese listed firms (2018-2022). Employing a novel Extreme Multi-Label Classification (XMLC) algorithm trained via contrastive learning and LLM-driven data augmentation, we map vacancy requirements to the ESCO framework. By benchmarking occupation-skill relationships against 2018 O*NET-ESCO mappings, we document a robust causal relationship between AI adoption and the expansion of skill portfolios. Our analysis identifies two distinct mechanisms. First, AI reduces information asymmetry in the labor market, enabling firms to specify current occupation-specific requirements with greater precision. Second, AI empowers firms to anticipate evolving labor market dynamics. We find that AI adoption significantly increases the demand for "forward-looking" skills--those absent from 2018 standards but subsequently codified in 2022 updates. This suggests that AI allows firms to lead, rather than follow, the formal evolution of occupational standards. Our findings highlight AI's dual role as both a stabilizer of current recruitment information and a catalyst for proactive adaptation to future skill shifts.
Date:	2026–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2601.03558

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.