nep-big 2025-05-19 papers

on Big Data

Issue of 2025–05–19
nineteen papers chosen by
Tom Coupé, University of Canterbury

Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation By Ziqi Li; Zhan Peng
Predicting Children's Travel Modes for School Journeys in Switzerland: A Machine Learning Approach Using National Census Data By Hannes Wallimann; Noah Balthasar
Classification-Based Analysis of Price Pattern Differences Between Cryptocurrencies and Stocks By Yu Zhang; Zelin Wu; Claudio Tessone
Deep Learning Models Meet Financial Data Modalities By Kasymkhan Khubiev; Mikhail Semenov
A Geospatial Approach to Measuring Economic Activity By Anton Yang; Jianwei Ai; Costas Arkolakis
Generative-discriminative machine learning models for high-frequency financial regime classification By Koukorinis, Andreas; Peters, Gareth W.; Germano, Guido
Unleashing the power of text for credit default prediction: Comparing human-written and generative AI-refined texts By Zongxiao Wu; Yizhe Dong; Yaoyiran Li; Baofeng Shi
MLOps Monitoring at Scale for Digital Platforms By Yu Jeffrey Hu; Jeroen Rombouts; Ines Wilms
Forecasting U.S. equity market volatility with attention and sentiment to the economy By Martina Halouskov\'a; \v{S}tefan Ly\'ocsa
An Introduction to Double/Debiased Machine Learning By Achim Ahrens; Victor Chernozhukov; Christian Hansen; Damian Kozbur; Mark Schaffer; Thomas Wiemann
Learning the Spoofability of Limit Order Books With Interpretable Probabilistic Neural Networks By Timoth\'ee Fabre; Damien Challet
Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks By Julian Junyan Wang; Victor Xiaoqi Wang
Double Machine Learning for Causal Inference under Shared-State Interference By Chris Hays; Manish Raghavan
Who is More Bayesian: Humans or ChatGPT? By Tianshi Mu; Pranjal Rawat; John Rust; Chengjun Zhang; Qixuan Zhong
LLM-powered Topic Modeling for Discovering Public Mental Health Trends in Social Media By Zhao, Chuqing; Chen, Yisong
Discrimination-free Insurance Pricing with Privatized Sensitive Attributes By Tianhe Zhang; Suhan Liu; Peng Shi
Solving economic models with neural networks without backpropagation By Julien Pascal
Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing By Cong William Lin; Wu Zhu
Agentic Workflows for Economic Research: Design and Implementation By Herbert Dawid; Philipp Harting; Hankui Wang; Zhongli Wang; Jiachen Yi

Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation

By:	Ziqi Li; Zhan Peng
Abstract:	Moran Eigenvector Spatial Filtering (ESF) approaches have shown promise in accounting for spatial effects in statistical models. Can this extend to machine learning? This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Moran Eigenvectors calculated from different spatial weights matrices, with and without a priori eigenvector selection, are tested. We assess the performance of popular machine learning models, including Random Forests, LightGBM, XGBoost, and TabNet, and benchmark their accuracies in terms of cross-validated R2 values against models that use only coordinates as features. We also extract coefficients and functions from the models using GeoShapley and compare them with the true processes. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets. Furthermore, we discuss that while these findings are relevant for spatial processes that exhibit positive spatial autocorrelation, they do not necessarily apply when modeling network autocorrelation and cases with negative spatial autocorrelation, where Moran Eigenvectors would still be useful.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.12450

Predicting Children's Travel Modes for School Journeys in Switzerland: A Machine Learning Approach Using National Census Data

By:	Hannes Wallimann; Noah Balthasar
Abstract:	Children's travel behavior plays a critical role in shaping long-term mobility habits and public health outcomes. Despite growing global interest, little is known about the factors influencing travel mode choice of children for school journeys in Switzerland. This study addresses this gap by applying a random forest classifier - a machine learning algorithm - to data from the Swiss Mobility and Transport Microcensus, in order to identify key predictors of children's travel mode choice for school journeys. Distance consistently emerges as the most important predictor across all models, for instance when distinguishing between active vs. non-active travel or car vs. non-car usage. The models show relatively high performance, with overall classification accuracy of 87.27% (active vs. non-active) and 78.97% (car vs. non-car), respectively. The study offers empirically grounded insights that can support school mobility policies and demonstrates the potential of machine learning in uncovering behavioral patterns in complex transport datasets.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.09947

Classification-Based Analysis of Price Pattern Differences Between Cryptocurrencies and Stocks

By:	Yu Zhang; Zelin Wu; Claudio Tessone
Abstract:	Cryptocurrencies are digital tokens built on blockchain technology, with thousands actively traded on centralized exchanges (CEXs). Unlike stocks, which are backed by real businesses, cryptocurrencies are recognized as a distinct class of assets by researchers. How do investors treat this new category of asset in trading? Are they similar to stocks as an investment tool for investors? We answer these questions by investigating cryptocurrencies' and stocks' price time series which can reflect investors' attitudes towards the targeted assets. Concretely, we use different machine learning models to classify cryptocurrencies' and stocks' price time series in the same period and get an extremely high accuracy rate, which reflects that cryptocurrency investors behave differently in trading from stock investors. We then extract features from these price time series to explain the price pattern difference, including mean, variance, maximum, minimum, kurtosis, skewness, and first to third-order autocorrelation, etc., and then use machine learning methods including logistic regression (LR), random forest (RF), support vector machine (SVM), etc. for classification. The classification results show that these extracted features can help to explain the price time series pattern difference between cryptocurrencies and stocks.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.12771

Deep Learning Models Meet Financial Data Modalities

By:	Kasymkhan Khubiev; Mikhail Semenov
Abstract:	Algorithmic trading relies on extracting meaningful signals from diverse financial data sources, including candlestick charts, order statistics on put and canceled orders, traded volume data, limit order books, and news flow. While deep learning has demonstrated remarkable success in processing unstructured data and has significantly advanced natural language processing, its application to structured financial data remains an ongoing challenge. This study investigates the integration of deep learning models with financial data modalities, aiming to enhance predictive performance in trading strategies and portfolio optimization. We present a novel approach to incorporating limit order book analysis into algorithmic trading by developing embedding techniques and treating sequential limit order book snapshots as distinct input channels in an image-based representation. Our methodology for processing limit order book data achieves state-of-the-art performance in high-frequency trading algorithms, underscoring the effectiveness of deep learning in financial applications.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.13521

A Geospatial Approach to Measuring Economic Activity

By:	Anton Yang; Jianwei Ai; Costas Arkolakis
Abstract:	We introduce a new methodology to detect and measure economic activity using geospatial data and apply it to steel production, a major industrial pollution source worldwide. Combining plant output data with geospatial data, such as ambient air pollutants, nighttime lights, and temperature, we train machine learning models to predict plant locations and output. We identify about 40% (70%) of plants missing from the training sample within a 1 km (5 km) radius and achieve R² above 0.8 for output prediction at a 1 km grid and at the plant level, as well as for both regional and time series validations. Our approach can be adapted to other industries and regions, and used by policymakers and researchers to track and measure industrial activity in near real time.
JEL:	Q50 Q53 R12
Date:	2025–03
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:33619

Generative-discriminative machine learning models for high-frequency financial regime classification

By:	Koukorinis, Andreas; Peters, Gareth W.; Germano, Guido
Abstract:	We combine a hidden Markov model (HMM) and a kernel machine (SVM/MKL) into a hybrid HMM-SVM/MKL generative-discriminative learning approach to accurately classify high-frequency financial regimes and predict the direction of trades. We capture temporal dependencies and key stylized facts in high-frequency financial time series by integrating the HMM to produce model-based generative feature embeddings from microstructure time series data. These generative embeddings then serve as inputs to a SVM with single- and multi-kernel (MKL) formulations for predictive discrimination. Our methodology, which does not require manual feature engineering, improves classification accuracy compared to single-kernel SVMs and kernel target alignment methods. It also outperforms both logistic classifier and feed-forward networks. This hybrid HMM-SVM-MKL approach shows high-frequency time-series classification improvements that can significantly benefit applications in finance.
Keywords:	Fisher information kernel; hidden Markov model; Kernel methods; support vector machine
JEL:	C1 F3 G3
Date:	2025–06
URL:	https://d.repec.org/n?u=RePEc:ehl:lserod:128016

Unleashing the power of text for credit default prediction: Comparing human-written and generative AI-refined texts

By:	Zongxiao Wu; Yizhe Dong; Yaoyiran Li; Baofeng Shi
Abstract:	This study explores the integration of a representative large language model, ChatGPT, into lending decision-making with a focus on credit default prediction. Specifically, we use ChatGPT to analyse and interpret loan assessments written by loan officers and generate refined versions of these texts. Our comparative analysis reveals significant differences between generative artificial intelligence (AI)-refined and human-written texts in terms of text length, semantic similarity, and linguistic representations. Using deep learning techniques, we show that incorporating unstructured text data, particularly ChatGPT-refined texts, alongside conventional structured data significantly enhances credit default predictions. Furthermore, we demonstrate how the contents of both human-written and ChatGPT-refined assessments contribute to the models' prediction and show that the effect of essential words is highly context-dependent. Moreover, we find that ChatGPT's analysis of borrower delinquency contributes the most to improving predictive accuracy. We also evaluate the business impact of the models based on human-written and ChatGPT-refined texts, and find that, in most cases, the latter yields higher profitability than the former. This study provides valuable insights into the transformative potential of generative AI in financial services.
Date:	2025–03
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2503.18029

MLOps Monitoring at Scale for Digital Platforms

By:	Yu Jeffrey Hu; Jeroen Rombouts; Ines Wilms
Abstract:	Machine learning models are widely recognized for their strong performance in forecasting. To keep that performance in streaming data settings, they have to be monitored and frequently re-trained. This can be done with machine learning operations (MLOps) techniques under supervision of an MLOps engineer. However, in digital platform settings where the number of data streams is typically large and unstable, standard monitoring becomes either suboptimal or too labor intensive for the MLOps engineer. As a consequence, companies often fall back on very simple worse performing ML models without monitoring. We solve this problem by adopting a design science approach and introducing a new monitoring framework, the Machine Learning Monitoring Agent (MLMA), that is designed to work at scale for any ML model with reasonable labor cost. A key feature of our framework concerns test-based automated re-training based on a data-adaptive reference loss batch. The MLOps engineer is kept in the loop via key metrics and also acts, pro-actively or retrospectively, to maintain performance of the ML model in the production stage. We conduct a large-scale test at a last-mile delivery platform to empirically validate our monitoring framework.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.16789

Forecasting U.S. equity market volatility with attention and sentiment to the economy

By:	Martina Halouskov\'a; \v{S}tefan Ly\'ocsa
Abstract:	Macroeconomic variables are known to significantly impact equity markets, but their predictive power for price fluctuations has been underexplored due to challenges such as infrequency and variability in timing of announcements, changing market expectations, and the gradual pricing in of news. To address these concerns, we estimate the public's attention and sentiment towards ten scheduled macroeconomic variables using social media, news articles, information consumption data, and a search engine. We use standard and machine-learning methods and show that we are able to improve volatility forecasts for almost all 404 major U.S. stocks in our sample. Models that use sentiment to macroeconomic announcements consistently improve volatility forecasts across all economic sectors, with the greatest improvement of 14.99% on average against the benchmark method - on days of extreme price variation. The magnitude of improvements varies with the data source used to estimate attention and sentiment, and is found within machine-learning models.
Date:	2025–03
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2503.19767

An Introduction to Double/Debiased Machine Learning

By:	Achim Ahrens; Victor Chernozhukov; Christian Hansen; Damian Kozbur; Mark Schaffer; Thomas Wiemann
Abstract:	This paper provides a practical introduction to Double/Debiased Machine Learning (DML). DML provides a general approach to performing inference about a target parameter in the presence of nuisance parameters. The aim of DML is to reduce the impact of nuisance parameter estimation on estimators of the parameter of interest. We describe DML and its two essential components: Neyman orthogonality and cross-fitting. We highlight that DML reduces functional form dependence and accommodates the use of complex data types, such as text data. We illustrate its application through three empirical examples that demonstrate DML's applicability in cross-sectional and panel settings.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.08324

Learning the Spoofability of Limit Order Books With Interpretable Probabilistic Neural Networks

By:	Timoth\'ee Fabre; Damien Challet
Abstract:	This paper investigates real-time detection of spoofing activity in limit order books, focusing on cryptocurrency centralized exchanges. We first introduce novel order flow variables based on multi-scale Hawkes processes that account both for the size and placement distance from current best prices of new limit orders. Using a Level-3 data set, we train a neural network model to predict the conditional probability distribution of mid price movements based on these features. Our empirical analysis highlights the critical role of the posting distance of limit orders in the price formation process, showing that spoofing detection models that do not take the posting distance into account are inadequate to describe the data. Next, we propose a spoofing detection framework based on the probabilistic market manipulation gain of a spoofing agent and use the previously trained neural network to compute the expected gain. Running this algorithm on all submitted limit orders in the period 2024-12-04 to 2024-12-07, we find that 31% of large orders could spoof the market. Because of its simple neuronal architecture, our model can be run in real time. This work contributes to enhancing market integrity by providing a robust tool for monitoring and mitigating spoofing in both cryptocurrency exchanges and traditional financial markets.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.15908

Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

By:	Julian Junyan Wang; Victor Xiaoqi Wang
Abstract:	This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MD&As, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not consistently demonstrate better consistency and reproducibility, with task-specific patterns emerging. LLMs significantly outperform expert human annotators in consistency and maintain high agreement even where human experts significantly disagree. We further find that simple aggregation strategies across 3-5 runs dramatically improve consistency. Simulation analysis reveals that despite measurable inconsistency in LLM outputs, downstream statistical inferences remain remarkably robust. These findings address concerns about what we term "G-hacking, " the selective reporting of favorable outcomes from multiple Generative AI runs, by demonstrating that such risks are relatively low for finance and accounting tasks.
Date:	2025–03
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2503.16974

Double Machine Learning for Causal Inference under Shared-State Interference

By:	Chris Hays; Manish Raghavan
Abstract:	Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain shared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.08836

Who is More Bayesian: Humans or ChatGPT?

By:	Tianshi Mu; Pranjal Rawat; John Rust; Chengjun Zhang; Qixuan Zhong
Abstract:	We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the ``representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and ``conservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.10636

LLM-powered Topic Modeling for Discovering Public Mental Health Trends in Social Media

By:	Zhao, Chuqing; Chen, Yisong
Abstract:	Online platforms such as Reddit have become significant spaces for public discussions on mental health, offering valuable insights into psychological distress and support-seeking behaviors. Large Language Models (LLMs) have emerged as powerful tools for analyzing these discussions, enabling the identification of mental health trends, crisis signals, and potential interventions. This work develops an LLM-based topic modeling framework tailored for domain-specific mental health discourse, uncovering latent themes within user-generated content. Additionally, an interactive and interpretable visualization system is designed to allow users to explore data at various levels of granularity, enhancing the understanding of mental health narratives. This approach aims to bridge the gap between large-scale AI analysis and human-centered interpretability, contributing to more effective and responsible mental health insights on social media.
Date:	2025–05–02
URL:	https://d.repec.org/n?u=RePEc:osf:socarx:xbpts_v1

Discrimination-free Insurance Pricing with Privatized Sensitive Attributes

By:	Tianhe Zhang; Suhan Liu; Peng Shi
Abstract:	Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.11775

Solving economic models with neural networks without backpropagation

By:	Julien Pascal
Abstract:	This paper presents a novel method to solve high-dimensional economic models using neural networks when the exact calculation of the gradient by backpropagation is impractical or inapplicable. This method relies on the gradient-free bias-corrected Monte Carlo (bc-MC) operator, which constitutes, under certain conditions, an asymptotically unbiased estimator of the gradient of the loss function. This method is well-suited for high-dimensional models, as it requires only two evaluations of a residual function to approximate the gradient of the loss function, regardless of the model dimension. I demonstrate that the gradient-free bias-corrected Monte Carlo operator has appealing properties as long as the economic model satisfies Lipschitz continuity. This makes the method particularly attractive in situations involving non-differentiable loss functions. I demonstrate the broad applicability of the gradient-free bc-MC operator by solving large-scale overlapping generations (OLG) models with aggregate uncertainty, including scenarios involving borrowing constraints that introduce non-differentiability in household optimization problems.
Keywords:	Dynamic programming, neural networks, machine learning, Monte Carlo, overlapping generations, occasionally binding constraints.
JEL:	C45 C61 C63 C68 E32 E37
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:bcl:bclwop:bclwp196

Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing

By:	Cong William Lin; Wu Zhu
Abstract:	Large Language Models (LLMs), such as ChatGPT, are reshaping content creation and academic writing. This study investigates the impact of AI-assisted generative revisions on research manuscripts, focusing on heterogeneous adoption patterns and their influence on writing convergence. Leveraging a dataset of over 627, 000 academic papers from arXiv, we develop a novel classification framework by fine-tuning prompt- and discipline-specific large language models to detect the style of ChatGPT-revised texts. Our findings reveal substantial disparities in LLM adoption across academic disciplines, gender, native language status, and career stage, alongside a rapid evolution in scholarly writing styles. Moreover, LLM usage enhances clarity, conciseness, and adherence to formal writing conventions, with improvements varying by revision type. Finally, a difference-in-differences analysis shows that while LLMs drive convergence in academic writing, early adopters, male researchers, non-native speakers, and junior scholars exhibit the most pronounced stylistic shifts, aligning their writing more closely with that of established researchers.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.13629

Agentic Workflows for Economic Research: Design and Implementation

By:	Herbert Dawid; Philipp Harting; Hankui Wang; Zhongli Wang; Jiachen Yi
Abstract:	This paper introduces a methodology based on agentic workflows for economic research that leverages Large Language Models (LLMs) and multimodal AI to enhance research efficiency and reproducibility. Our approach features autonomous and iterative processes covering the entire research lifecycle--from ideation and literature review to economic modeling and data processing, empirical analysis and result interpretation--with strategic human oversight. The workflow architecture comprises specialized agents with clearly defined roles, structured inter-agent communication protocols, systematic error escalation pathways, and adaptive mechanisms that respond to changing research demand. Human-in-the-loop (HITL) checkpoints are strategically integrated to ensure methodological validity and ethical compliance. We demonstrate the practical implementation of our framework using Microsoft's open-source platform, AutoGen, presenting experimental examples that highlight both the current capabilities and future potential of agentic workflows in improving economic research.
Date:	2025–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2504.09736

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.