nep-big 2026-05-11 papers

on Big Data

Issue of 2026–05–11
eleven papers chosen by
Tom Coupé, University of Canterbury

Forecasting Recessions Using Machine Learning on Text Data and Mixed-Frequency Predictors By Yusuke Oh; Mototsugu Shintani
Advancing Predictive Analytics in Child Malnutrition: Machine, Ensemble and Deep Learning Models with Balanced Class Distribution for Early Detection of Stunting and Wasting By Mgomezulu, Wisdom Richard; Thangata, Paul; Mkandawire, Bertha; Amoah, Nana
A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective By Olivia Zhang; Zhilin Zhang
What Did I Forget? Basket Analysis for Large Assortments Using Transformers By Luuk van Maasakkers; Bas Donkers; Dennis Fok
Gridded-labour market data in Ghana using remote sensing and random forest By Jin, Yan,; Charpe, Matthieu,; Mei, Yang,; Li, Zeshuo,
Beyond Sequential Prediction: Learning Financial Market Dynamics in Volatile and Non-Stationary Environments through Sentiment-Conditioned Generative Modelling By Alexis Lazanas; Spyridon Karpouzis
AI-Based Forecasting of Czech Inflation: Quantile Regression Forests with Dynamic Weights By Filip Blaha; Jan Botka; Josef Sveda; Ales Michl
Machine Learning Forecasts of Asymmetric Betas Using Firm-Specific Information By Thomas Conlon; John Cotter; Iason Kynigakis
Determinants of Liquidity in the Japanese Government Bond Market: An Interpretable Machine Learning Approach By Satoko Kojima; Toshiyuki Sakiyama
Do News and Social Media Tell the Same Story? Constructing and Comparing Sentiment Spillover Networks By Fan Wu; Anqi Liu; Jing Chen; Yuhua Li
Fusing Generative AI and Economic Modelling to Estimate Field-Level Crop Production in Data-Scarce World Regions By Baumert, Josef; Heckelei, Thomas; Estes, Lyndon; Storm, Hugo

Forecasting Recessions Using Machine Learning on Text Data and Mixed-Frequency Predictors

By:	Yusuke Oh (Deputy Director, Institute for Monetary and Economic Studies, Bank of Japan (E-mail: yuusuke.ou@boj.or.jp)); Mototsugu Shintani (The University of Tokyo (E-mail: shintani@e.u-tokyo.ac.jp))
Abstract:	We forecast Japanese recessions by integrating machine learning methods, mixed-frequency data, and text-based indicators within an unrestricted mixed data sampling (U-MIDAS) framework. The model combines monthly macroeconomic variables with weekly financial indicators and newspaper-based text indicators. A pseudo-real-time forecasting exercise over three decades shows that machine learning models consistently outperform traditional logit benchmarks. The model confidence set (MCS) suggests horizon dependence: Text indicators are more informative at short horizons, while financial variables are more informative at longer horizons. To improve interpretability, we apply sparse principal component analysis (Sparse PCA) to the text indicators and identify three economic narratives: 'Corporate Distress, ' 'Financial Distress, ' and 'Deflationary Pressure.' Furthermore, SHAP (SHapley Additive exPlanations) analysis indicates that different recession episodes are associated with different combinations of these narratives, underscoring the heterogeneous nature of economic downturns.
Keywords:	business cycles, mixed data sampling, model confidence set, text analysis, recession forecasting
JEL:	C32 C53 E37 O53
Date:	2026–03
URL:	https://d.repec.org/n?u=RePEc:ime:imedps:26-e-07

Advancing Predictive Analytics in Child Malnutrition: Machine, Ensemble and Deep Learning Models with Balanced Class Distribution for Early Detection of Stunting and Wasting

By:	Mgomezulu, Wisdom Richard; Thangata, Paul; Mkandawire, Bertha; Amoah, Nana
Abstract:	Child malnutrition remains a critical public health challenge in sub-Saharan Africa, with 2 traditional surveillance methods proving inadequate for early detection and intervention. This 3 study leverages advanced machine learning and deep learning techniques to revolutionize stunting 4 and wasting prediction in Malawi, utilizing nationally representative World Bank’s Living 5 Standards Measurement Surveys (LSMS) data to develop robust predictive models capable of 6 identifying at-risk children before clinical manifestations emerge. Seven classification algorithms 7 were evaluated, including ensemble methods (Random Forest, XGBoost), Deep Neural Networks 8 (DNN), and traditional approaches (SVM, Logistic Regression, KNN, Gradient Boosting). Class 9 imbalance challenges were addressed through SMOTE implementation and strategic class 10 weighting. Model performance was assessed using accuracy, precision, recall, F1-score, and 11 AUC-ROC metrics across balanced datasets. Results demonstrate exceptional predictive 12 capabilities, with Random Forest achieving perfect performance for wasting prediction (100% 13 accuracy, precision, recall, F1-score, and AUC-ROC) and near-perfect stunting classification 14 (99.98% accuracy). XGBoost demonstrated comparable excellence with 99.49% accuracy for 15 wasting and 95.52% for stunting prediction. DNN showed strong performance (91.50% wasting 16 accuracy, 76.64% stunting accuracy), while traditional methods exhibited moderate effectiveness, 17 with logistic regression achieving the lowest performance (66.58% wasting, 64.72% stunting 18 accuracy). These findings represent a paradigm shift toward proactive nutritional surveillance, 19 enabling early identification of vulnerable populations through data-driven approaches. The 20 superior performance of ensemble algorithms provides policymakers with powerful tools for 21 evidence-based resource allocation and targeted interventions. Implementation of these predictive 22 models within Malawi's health systems could significantly enhance early detection capabilities, 23 facilitate timely nutritional interventions, and contribute substantially to achieving global 24 nutrition targets while reducing childhood mortality rates.
Keywords:	Food Security and Poverty
Date:	2026–03
URL:	https://d.repec.org/n?u=RePEc:ags:aes026:397868

A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective

By:	Olivia Zhang; Zhilin Zhang
Abstract:	Large language models (LLMs) are increasingly deployed in quantitative finance for stock price forecasting. This review synthesizes recent applications of LLMs in this domain, including extracting sentiment from financial news and social media, analyzing financial reports and earnings-call transcripts, tokenizing or symbolizing stock price series, and constructing multi-agent trading systems. Particular attention is paid to practical pitfalls that are often understated in the literature, such as fragility in sentiment analysis, dataset and horizon design, performance evaluation metrics, data leakage, illiquidity premia, and limits of stock price predictability. Organized from a hedge-fund perspective, the review is intended to guide both academic researchers and hedge fund managers in integrating LLMs into real-world trading pipelines and in stress-testing their robustness under realistic market frictions.
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2605.05211

What Did I Forget? Basket Analysis for Large Assortments Using Transformers

By:	Luuk van Maasakkers (Erasmus University Rotterdam); Bas Donkers (Erasmus University Rotterdam); Dennis Fok (Erasmus University Rotterdam)
Abstract:	We propose a new method for learning product complementarity patterns in shopping baskets, inspired by Google's Bidirectional Encoder Representations from Transformers (BERT) for natural language processing. We reformulate BERT's masked learning task in a marketing context and learn to accurately identify missing products from a real-life grocery shopping basket based on the other products purchased in that same basket. The resulting model, which we call BaskERT, can be used by retailers for personalized product recommendations and for analyzing product complementarity patterns across the assortment. BaskERT outperforms several state-of-the-art benchmarks in a basket completion task. Different procedures for sampling the missing product during training impact the variety of recommendations returned by the model. This enables marketers to steer their recommendations away from the most popular products. The model is easily scalable to large assortments. As our model only requires basket data from the current shopping trip, it is applicable in many situations, also when customer information and purchase history data are not available, for example because of privacy regulations.
Keywords:	product basket prediction, machine learning, transformers, product embedding
JEL:	M31
Date:	2025–12–11
URL:	https://d.repec.org/n?u=RePEc:tin:wpaper:20250071

Gridded-labour market data in Ghana using remote sensing and random forest

By:	Jin, Yan,; Charpe, Matthieu,; Mei, Yang,; Li, Zeshuo,
Abstract:	This study presents the first high-resolution (0.005°) gridded labor market data, generated by downscaling district-level census data for Ghana using random forest algorithms and remote sensing. It addresses the lack of spatially disaggregated labor market data by mapping 17 employment categories—including age, gender, skills, status, sectors, unemployment, and NEET. Auxiliary data (64 variables) such as land cover, nighttime lights, infrastructure, and points of interest are integrated to capture demographic, economic, and participation factors. The model achieves high accuracy (R2 > 90% for most categories) and reveals significant spatial heterogeneity, with employment rates ranging from 10% to 98% across pixels. Results highlight urban-rural and North-South divides, as well as sectoral concentrations. Variable importance analysis underscores the role of built-up areas, nighttime light, road density, and vegetation health in predicting employment patterns, with specificity across different employment categories. The methodology advances beyond traditional GDP or population gridding by incorporating labor market complexity. Findings demonstrate the potential of machine learning and geospatial data to enhance socio-economic mapping in data-scarce contexts.
Keywords:	labour market analysis, mapping, human geography, information technology.
Date:	2026
URL:	https://d.repec.org/n?u=RePEc:ilo:ilowps:995694369302676

Beyond Sequential Prediction: Learning Financial Market Dynamics in Volatile and Non-Stationary Environments through Sentiment-Conditioned Generative Modelling

By:	Alexis Lazanas; Spyridon Karpouzis
Abstract:	The problem of time-series forecasting in non-stationary and complex environments is a challenging task in machine learning, especially with heterogeneous numerical and textual data present. Traditional statistical models like AutoRegressive Integrated Moving Average (ARIMA) are based on the assumptions of linearity and stationarity, whereas recurrent neural networks like Long Short-Term Memory (LSTM) models do not necessarily represent distributional properties in highly volatile settings. This paper proposes a hybrid model that combines Generative Adversarial Networks (GANs) with Natural Language Processing (NLP)-based sentiment analysis to enable sentiment-conditioned time-series prediction. The model integrates adversarial learning on numerical sequences with contextual sentiment representations derived from unstructured text, enabling them to be jointly modelled to capture temporal dynamics and exogenous information. These results demonstrate the promise of hybrid generative and language-aware methods to enhance prediction robustness in non-stationary environments.
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2604.22801

AI-Based Forecasting of Czech Inflation: Quantile Regression Forests with Dynamic Weights

By:	Filip Blaha; Jan Botka; Josef Sveda; Ales Michl
Abstract:	We construct a quantile regression forest for inflation forecasting in the Czech Republic, inspired by growing literature on the use of Machine Learning in macroeconomics and finance. We contribute to the literature by implementing an optimisation scheme with time-varying weights that incorporates information from the entire distribution to form the point forecast. By dynamically reflecting the distribution of future inflation paths, our framework outperforms both standard mean and median point forecasts and delivers gains relative to conventional linear benchmark models. We also forecast individual inflation subcomponents that enable us to disentangle the drivers of future inflation and its risks. Furthermore, we integrate the Shapley-value decomposition to enhance the interpretability of our results and adjust the model's predictors for a small open economy.
Keywords:	Czech Republic, forecasting, inflation, machine learning, quantile regression forest, small open economy, time varying weights
JEL:	C53 C55 E31 E37 E52
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:cnb:wpaper:2026/09

Machine Learning Forecasts of Asymmetric Betas Using Firm-Specific Information

By:	Thomas Conlon; John Cotter; Iason Kynigakis
Abstract:	We demonstrate that machine learning methods provide a powerful framework for modelling conditional asymmetric risk. Using a large cross-section of US stocks and a comprehensive set of firm characteristics, we show that allowing for nonlinearities significantly increases the out-of-sample performance across a wide range of asymmetric beta measures and forecasting horizons. Trading frictions, followed by characteristics related to intangibles, momentum and growth, emerge as the most important drivers of future risk dynamics. Reconstructing CAPM beta from forecasts of asymmetric beta components indicates that a more granular decomposition of systematic risk yields a more accurate representation of market beta. We also find that incorporating conditional beta forecasts into discounted cash flow models that account for the term structure of betas enhances equity valuation accuracy. Finally, we show that the statistical outperformance of conditional betas translates into economically significant benefits for market-neutral portfolio investors.
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2604.22933

Determinants of Liquidity in the Japanese Government Bond Market: An Interpretable Machine Learning Approach

By:	Satoko Kojima (Director, Institute for Monetary and Economic Studies, Bank of Japan (Email: satoko.kojima@boj.or.jp)); Toshiyuki Sakiyama (Director and Senior Economist, Institute for Monetary and Economic Studies, Bank of Japan (Email: toshiyuki.sakiyama@boj.or.jp))
Abstract:	Liquidity in government bond markets is critical for the functioning of financial markets. This paper studies the determinants of market liquidity, measured by price dispersion, by constructing various bond features using high-granularity data from the Bank of Japan Financial Network System and applying machine learning approaches. The main findings are threefold. First, the decomposition of the liquidity indicator into bond features reveals that the historical volatility of benchmark prices of Japanese government bonds has been the main driver of the liquidity indicator, while the contributions of the share of non- clearing participants' transactions and the share of the central bank's transactions and holdings have increased since around 2022. Second, some bond features affect the liquidity indicator non-linearly. For bond features such as the share of foreign financial institutions' transactions, the number of trading financial institutions, and the share of the central bank's holdings, the liquidity indicator improves as the values of these bond features increase, but deteriorates once they exceed certain thresholds. Third, bond features such as maturity, the historical volatility of benchmark prices, and the number of trading counterparties per institution affect the liquidity indicator by strongly interacting with other bond features.
Keywords:	Market liquidity, Government bond markets, Bond features, Machine learning approach
JEL:	C59 G12
Date:	2026–03
URL:	https://d.repec.org/n?u=RePEc:ime:imedps:26-e-03

Do News and Social Media Tell the Same Story? Constructing and Comparing Sentiment Spillover Networks

By:	Fan Wu; Anqi Liu; Jing Chen; Yuhua Li
Abstract:	Investor sentiment reflects the collective attitude of investors towards the asset, whether positive, negative or neutral. Market information, such as news and relevant social media posts, plays a significant role in shaping investor sentiment, which influences investment decisions accordingly. The sentiment for one single company may spill over to other relevant companies which are in the same industry. The information spillover network pattern between news and social media may also differ, as they are two different media sources. In this study, we introduce a network-based transfer entropy method to measure and compare the information transmission of news and social media sentiment across the technology companies. We examine whether and to what extent sentiment information from one company can transfer to other companies, and how different the spillover effect is for news and social media. The result signifies a stronger intensity of news information flow among the tech companies after COVID-19. We also highlight the companies which act as information hubs in the sentiment network. Furthermore, we identify the companies which lead the strongest information flow chain. Overall, this study provides a novel perspective in modelling sentiment spillover under two different media sources, and we find that news and social media show a different information transmission pattern during the studied period.
Date:	2026–04
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2604.26811

Fusing Generative AI and Economic Modelling to Estimate Field-Level Crop Production in Data-Scarce World Regions

By:	Baumert, Josef; Heckelei, Thomas; Estes, Lyndon; Storm, Hugo
Abstract:	Spatially explicit information on farmers’ crop choice and how it is impacted by changing environmental or economic conditions is essential to foster food security and predict future crop production. While such knowledge could be particularly beneficial in world regions with large rural populations and high climate risk, data scarcity often impedes crop production mapping and modelling. We present a novel approach that links environmental data, satellite imagery, and regional statistics using economic modelling and generative AI for high-resolution crop choice mapping and modelling without labelled observations. The approach builds on two components: first, we employ a reduced-form model based on economic theory to express the cultivation probability of a crop at a specific location as a function of environmental and potentially economic conditions. Second, we use k-Deep Variational Autoencoders, a class of generative neural networks, to cluster pixels with similar appearance on satellite imagery into groups that can be associated to crop types. By linking both components and jointly estimating all model parameters, the economic model provides prior knowledge to the clustering approach while benefiting from the information entailed in the remote sensing imagery. Validation for France indicates high overall accuracies of the obtained crop maps. We additionally apply the approach to northern Ghana and simulate how an increase in in-season droughts would impact the spatial distribution of major food crops, information crucial for food security policies. Our method is applicable to numerous world regions.
Keywords:	Research and Development/Tech Change/Emerging Technologies
Date:	2026–03
URL:	https://d.repec.org/n?u=RePEc:ags:aes026:397894

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.