nep-big 2025-07-28 papers

on Big Data

Issue of 2025–07–28
eighteen papers chosen by
Tom Coupé, University of Canterbury

Satellites turn “concrete”: Tracking cement with satellite data and neural networks By Alexandre d'Aspremont; Simon Ben Arous; Jean-Charles Bricongne; Benjamin Lietti; Baptiste Meunier
Interpretable Machine Learning for Macro Alpha: A News Sentiment Case Study By Yuke Zhang
Forecasting Budgetary Items in Türkiye Using Deep Learning By Altug Aydemir; Cem Cebi
Does Deep Learning Improve Forecast Accuracy of Crude Oil Prices? Evidence from a Neural Network Approach By Altug Aydemir; Mert Gokcu
Adjusting Manual Rates to Own Experience: Comparing the Credibility Approach to Machine Learning By Giorgio Alfredo Spedicato; Christophe Dutang; Quentin Guibert
A Review of Financial Data Analysis Techniques for Unstructured Data in the Deep Learning Era: Methods, Challenges, and Applications By Duane, Jackson; Morgan, Ashley; Carter, Emily
Quantum Reservoir Computing for Realized Volatility Forecasting By Qingyu Li; Chiranjib Mukhopadhyay; Abolfazl Bayat; Ali Habibnia
Machine Learning Applications in Credit Risk Prediction By Kubra Bolukbas; Ertan Tok
IISE PG&E Energy Analytics Challenge 2025: Hourly-Binned Regression Models Beat Transformers in Load Forecasting By Millend Roy; Vladimir Pyltsov; Yinbo Hu
Predictive modeling the past By Paker, Meredith; Stephenson, Judy; Wallis, Patrick
Predicting Financial Market Crises using Multilayer Network Analysis and LSTM-based Forecasting of Spillover Effects By Mahdi Kohan Sefidi
FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs By Junzhe Jiang; Chang Yang; Aixin Cui; Sihan Jin; Ruiyu Wang; Bo Li; Xiao Huang; Dongning Sun; Xinrun Wang
CATS: Clustering-Aggregated and Time Series for Business Customer Purchase Intention Prediction By Yingjie Kuang; Tianchen Zhang; Zhen-Wei Huang; Zhongjie Zeng; Zhe-Yuan Li; Ling Huang; Yuefang Gao
Nutritional Benefits of Fostering: Evidence from Longitudinal Data in South Africa By Dumas, Christelle; Gautrain, Elsa; Gosselin-Pali, Adrien
Measuring Family (Dis)Advantage: Lessons from Detailed Parental Information By Sander de Vries
Ethnic and gender bias in Large Language Models across contexts By Capistrano, Daniel; Creighton, Mathew; Fernández-Reino, Mariña
AI shrinkage: a data-driven approach for risk-optimized portfolios By Gianluca De Nard; Damjan Kostovic
Can NASDAQ-100 derivatives ETF portfolio beat QQQ? By Lo, Chi-Sheng

Satellites turn “concrete”: Tracking cement with satellite data and neural networks

By:	Alexandre d'Aspremont (LIENS - Laboratoire d'informatique de l'école normale supérieure - DI-ENS - Département d'informatique - ENS-PSL - ENS-PSL - École normale supérieure - Paris - PSL - Université Paris Sciences et Lettres - Inria - Institut National de Recherche en Informatique et en Automatique - CNRS - Centre National de la Recherche Scientifique - CNRS - Centre National de la Recherche Scientifique, SIERRA - Statistical Machine Learning and Parsimony - DI-ENS - Département d'informatique - ENS-PSL - ENS-PSL - École normale supérieure - Paris - PSL - Université Paris Sciences et Lettres - Inria - Institut National de Recherche en Informatique et en Automatique - CNRS - Centre National de la Recherche Scientifique - CNRS - Centre National de la Recherche Scientifique - Centre Inria de Paris - Inria - Institut National de Recherche en Informatique et en Automatique, Kayrros); Simon Ben Arous (Kayrros); Jean-Charles Bricongne (LEO - Laboratoire d'Économie d'Orleans [2022-...] - UO - Université d'Orléans - UT - Université de Tours - UCA - Université Clermont Auvergne, Centre de recherche de la Banque de France - Banque de France); Benjamin Lietti (EPEE - Centre d'Etudes des Politiques Economiques - UEVE - Université d'Évry-Val-d'Essonne - Université Paris-Saclay); Baptiste Meunier (Centre de recherche de la Banque Centrale européenne - Banque Centrale Européenne, AMSE - Aix-Marseille Sciences Economiques - EHESS - École des hautes études en sciences sociales - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique)
Abstract:	This paper exploits daily infrared images taken from satellites to track economic activity in advanced and emerging countries. We first develop a framework to read, clean, and exploit satellite images. Our algorithm uses the laws of physics (Planck's law) and machine learning to detect the heat produced by cement plants in activity. This allows us to monitor in real-time whether a cement plant is working. Using this on around 1, 000 plants, we construct a satellitebased index. We show that using this satellite index outperforms benchmark models and alternative indicators for nowcasting the production of the cement industry as well as the activity in the construction sector. Comparing across methods, neural networks appear to yield more accurate predictions as they allow to exploit the granularity of our dataset. Overall, combining satellite images and machine learning can help policymakers to take informed and swift economic policy decisions by nowcasting accurately and in real-time economic activity.
Keywords:	Big data, Data science, Machine learning, Construction, High-frequency data
Date:	2024
URL:	https://d.repec.org/n?u=RePEc:hal:journl:hal-05104995

Interpretable Machine Learning for Macro Alpha: A News Sentiment Case Study

By:	Yuke Zhang
Abstract:	This study introduces an interpretable machine learning (ML) framework to extract macroeconomic alpha from global news sentiment. We process the Global Database of Events, Language, and Tone (GDELT) Project's worldwide news feed using FinBERT -- a Bidirectional Encoder Representations from Transformers (BERT) based model pretrained on finance-specific language -- to construct daily sentiment indices incorporating mean tone, dispersion, and event impact. These indices drive an XGBoost classifier, benchmarked against logistic regression, to predict next-day returns for EUR/USD, USD/JPY, and 10-year U.S. Treasury futures (ZN). Rigorous out-of-sample (OOS) backtesting (5-fold expanding-window cross-validation, OOS period: c. 2017-April 2025) demonstrates exceptional, cost-adjusted performance for the XGBoost strategy: Sharpe ratios achieve 5.87 (EUR/USD), 4.65 (USD/JPY), and 4.65 (Treasuries), with respective compound annual growth rates (CAGRs) exceeding 50% in Foreign Exchange (FX) and 22% in bonds. Shapley Additive Explanations (SHAP) affirm that sentiment dispersion and article impact are key predictive features. Our findings establish that integrating domain-specific Natural Language Processing (NLP) with interpretable ML offers a potent and explainable source of macro alpha.
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2505.16136

Forecasting Budgetary Items in Türkiye Using Deep Learning

By:	Altug Aydemir; Cem Cebi
Abstract:	This study aims at forecasting the future behavior of budget variables, using Artificial Neural Network (ANN) and Deep Neural Network (DNN) techniques for Türkiye. Particularly, we focus on budget expenditures, tax revenues and their main components. Annual data were used and divided into two sub-periods: a training set (2002-2019) and a test set (2020-2022). Each fiscal item is estimated using relevant explanatory variables selected based on economic theory. We achieved good forecasting performance for main budget items using ANN and DNN methodologies. We found that most of the Mean Absolute Error (MAE) values fell within the acceptable range, an indicator of good prediction performance. Second, we see that the MAE values for public expenditures are lower than taxes. Third, estimating total tax revenues (aggregate data) performs better compared to subcomponents of taxes (disaggregated data). The opposite is the case for public expenditures.
Keywords:	Machine Learning, Deep Learning, Artificial Neural Network (ANN), Deep Neural Network (DNN), Budget Forecast, Government Spending, Tax Revenue
JEL:	C53 H20 H50 H68
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:tcb:wpaper:2509

Does Deep Learning Improve Forecast Accuracy of Crude Oil Prices? Evidence from a Neural Network Approach

By:	Altug Aydemir; Mert Gokcu
Abstract:	[EN] In recent years, machine learning-based techniques have gained prominence in forecasting crude oil prices due to their ability effectively handle the highly volatile and nonlinear nature of oil prices. The primary objective of this paper is to forecast monthly oil prices with the highest level of precision and accuracy possible. To do this, we propose a deepened and high-parametrized version of the deep neural network model framework that integrates widely adopted algorithms and a variety of datasets. Additionally, our approach involves the optimal architecture for deep neural networks used in oil price forecasting and offers forecasts that are repeatable and consistent. All the evaluation metrics values indicate that the proposed model achieves superior forecasting performance compared to some simple conventional statistical models. [TR] Son zamanlarda, makine ogrenimi tabanli yontemler, petrol fiyatlarinin son derece oynak ve dogrusal olmayan dogasi ile etkin bir sekilde basa cikma yetenekleri sayesinde ham petrol fiyatlarini tahmin etmede onem kazanmistir. Bu calismanin temel amaci, aylik bazda petrol fiyatlarini mumkun olan en yuksek hassasiyet ve dogrulukla tahmin etmektir. Bunu yapmak icin, ham petrol fiyat tahmini icin iyi bilinen algoritmalari ve cesitli veri kumelerini kullanan derin sinir agi modeli cercevesinin derinlestirilmis ve yuksek parametreli bir versiyonunu oneriyoruz. Ayrica, yaklasimimiz petrol fiyat tahmininde kullanilan derin sinir aglari icin en uygun mimariyi icermekte ve tekrarlanabilir ve tutarli tahminler sunmaktadir. Tum degerlendirme metrik degerleri, onerilen modelimizin geleneksel yontemlere kiyasla tahmin performansinda onemli bir iyilesmeye sahip oldugunu gostermektedir.
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:tcb:econot:2511

Adjusting Manual Rates to Own Experience: Comparing the Credibility Approach to Machine Learning

By:	Giorgio Alfredo Spedicato (Leitha SRL); Christophe Dutang (ASAR - Applied Statistics And Reliability - ASAR - LJK - Laboratoire Jean Kuntzmann - Inria - Institut National de Recherche en Informatique et en Automatique - CNRS - Centre National de la Recherche Scientifique - UGA - Université Grenoble Alpes - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes); Quentin Guibert (CEREMADE - CEntre de REcherches en MAthématiques de la DEcision - Université Paris Dauphine-PSL - PSL - Université Paris Sciences et Lettres - CNRS - Centre National de la Recherche Scientifique, LSAF - Laboratoire de Sciences Actuarielle et Financière - UCBL - Université Claude Bernard Lyon 1 - Université de Lyon)
Abstract:	Credibility theory is the usual framework in actuarial science when it comes to reinforcing individual experience by transfering rates estimated from collective information. Based on the paradigm of transfer learning, this article presents the idea that a machine learning (ML) model pre-trained using a rich market data porfolio can improve the prediction of rates for an individual insurance portfolio. This framework consists first in training several ML models on a market portfolio of insurance data. Pre-trained models provide valuable information on relations between features and predicted rates. Furthermore, features shared with the company dataset are used to predict rates better than the same ML models trained on the insurer's dataset alone. Our approach is illustrated with classical ML models on an anonymized dataset including both market data and data from an European non-life insurance company, and is compared with a hierarchical Bühlmann-Straub credibility model. We observe the transfert learning stragegy combining company data with external market data significantly improves the prediction accuracy compared to a ML model only trained on the insurer's data and provides competitive results compared to hierarchical credibility models.
Keywords:	Transfer learning, Hierarchical credibility theory, Bühlmann credibility theory, Boosting, Deep Learning
Date:	2025–06–27
URL:	https://d.repec.org/n?u=RePEc:hal:journl:hal-04821310

A Review of Financial Data Analysis Techniques for Unstructured Data in the Deep Learning Era: Methods, Challenges, and Applications

By:	Duane, Jackson; Morgan, Ashley; Carter, Emily
Abstract:	Financial institutions are increasingly leveraging---such as text, audio, and images---to gain insights and competitive advantage. Deep learning (DL) has emerged as a powerful paradigm for analyzing these complex data types, transforming tasks like financial news analysis, earnings call interpretation, and document parsing. This paper provides a comprehensive academic review of deep learning techniques for unstructured financial data. We present a taxonomy of data types and DL methods, including natural language processing models, speech and audio processing frameworks, multimodal fusion approaches, and transformer-based architectures. We survey key applications ranging from sentiment analysis and market prediction to fraud detection, credit risk assessment, and beyond, highlighting recent advancements in each domain. Additionally, we discuss major challenges unique to financial settings, such as data scarcity and annotation cost, model interpretability and regulatory compliance, and the dynamic, non-stationary nature of financial data. We enumerate prominent datasets and benchmarks that have accelerated research, and identify research gaps and future directions. The review emphasizes the latest developments up to 2025, including the rise of large pre-trained models and multimodal learning, and outlines how these innovations are shaping the next generation of financial analytics.
Date:	2025–06–25
URL:	https://d.repec.org/n?u=RePEc:osf:osfxxx:gdvbj_v1

Quantum Reservoir Computing for Realized Volatility Forecasting

By:	Qingyu Li; Chiranjib Mukhopadhyay; Abolfazl Bayat; Ali Habibnia
Abstract:	Recent advances in quantum computing have demonstrated its potential to significantly enhance the analysis and forecasting of complex classical data. Among these, quantum reservoir computing has emerged as a particularly powerful approach, combining quantum computation with machine learning for modeling nonlinear temporal dependencies in high-dimensional time series. As with many data-driven disciplines, quantitative finance and econometrics can hugely benefit from emerging quantum technologies. In this work, we investigate the application of quantum reservoir computing for realized volatility forecasting. Our model employs a fully connected transverse-field Ising Hamiltonian as the reservoir with distinct input and memory qubits to capture temporal dependencies. The quantum reservoir computing approach is benchmarked against several econometric models and standard machine learning algorithms. The models are evaluated using multiple error metrics and the model confidence set procedures. To enhance interpretability and mitigate current quantum hardware limitations, we utilize wrapper-based forward selection for feature selection, identifying optimal subsets, and quantifying feature importance via Shapley values. Our results indicate that the proposed quantum reservoir approach consistently outperforms benchmark models across various metrics, highlighting its potential for financial forecasting despite existing quantum hardware constraints. This work serves as a proof-of-concept for the applicability of quantum computing in econometrics and financial analysis, paving the way for further research into quantum-enhanced predictive modeling as quantum hardware capabilities continue to advance.
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2505.13933

Machine Learning Applications in Credit Risk Prediction

By:	Kubra Bolukbas; Ertan Tok
Abstract:	The goal of this study is to identify the most effective model for predicting credit risk, the likelihood a commercial loan defaults (become a non-performing loan) in the Turkish banking sector and to determine which firm and loan characteristics influence that risk. The analysis draws on an unbalanced dataset of 1.2 million firm-level observations for 2018–2023, combining financial ratios with detailed loan- and firm-specific information. Class imbalance is addressed through oversampling (including SMOTE) and multiple down-sampling schemes. Although the risk is assessed ex-ante, model performance is evaluated ex-post using the ROC-AUC metric. Within tested conventional econometric and machine learning approaches accompanied with different sampling techniques, Extreme Gradient Boosting (XGBoost) with oversampling delivers the best result with a ROC-AUC score of 0.914. Compared with logistic regression under the same sampling setup, a 4.9- percentage-point increase in test ROC-AUC is attained, confirming the model’s superior predictive performance over conventional approaches. Accordingly, the study finds that the industry and location in which a firm operates, its loan-restructuring status, loan cost and type (fixed vs. floating rate), the firm’s record of bad checks, and core ratios capturing profitability, liquidity and leverage to be the most influential predictors of credit risk.
Keywords:	Credit Risk, Machine Learning Techniques, Financial Ratios, Banking Sector, Macro-Financial Stability, Feature Importance
JEL:	C52 C53 C55 G17 G2 G32 G33
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:tcb:wpaper:2508

IISE PG&E Energy Analytics Challenge 2025: Hourly-Binned Regression Models Beat Transformers in Load Forecasting

By:	Millend Roy; Vladimir Pyltsov; Yinbo Hu
Abstract:	Accurate electricity load forecasting is essential for grid stability, resource optimization, and renewable energy integration. While transformer-based deep learning models like TimeGPT have gained traction in time-series forecasting, their effectiveness in long-term electricity load prediction remains uncertain. This study evaluates forecasting models ranging from classical regression techniques to advanced deep learning architectures using data from the ESD 2025 competition. The dataset includes two years of historical electricity load data, alongside temperature and global horizontal irradiance (GHI) across five sites, with a one-day-ahead forecasting horizon. Since actual test set load values remain undisclosed, leveraging predicted values would accumulate errors, making this a long-term forecasting challenge. We employ (i) Principal Component Analysis (PCA) for dimensionality reduction and (ii) frame the task as a regression problem, using temperature and GHI as covariates to predict load for each hour, (iii) ultimately stacking 24 models to generate yearly forecasts. Our results reveal that deep learning models, including TimeGPT, fail to consistently outperform simpler statistical and machine learning approaches due to the limited availability of training data and exogenous variables. In contrast, XGBoost, with minimal feature engineering, delivers the lowest error rates across all test cases while maintaining computational efficiency. This highlights the limitations of deep learning in long-term electricity forecasting and reinforces the importance of model selection based on dataset characteristics rather than complexity. Our study provides insights into practical forecasting applications and contributes to the ongoing discussion on the trade-offs between traditional and modern forecasting methods.
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2505.11390

Predictive modeling the past

By:	Paker, Meredith; Stephenson, Judy; Wallis, Patrick
Abstract:	Understanding long-run economic growth requires reliable historical data, yet the vast majority of long-run economic time series are drawn from incomplete records with significant temporal and geographic gaps. Conventional solutions to these gaps rely on linear regressions that risk bias or overfitting when data are scarce. We introduce “past predictive modeling, ” a framework that leverages machine learning and out-of-sample predictive modeling techniques to reconstruct representative historical time series from scarce data. Validating our approach using nominal wage data from England, 1300-1900, we show that this new method leads to more accurate and generalizable estimates, with bootstrapped standard errors 72% lower than benchmark linear regressions. Beyond just bettering accuracy, these improved wage estimates for England yield new insights into the impact of the Black Death on inequality, the economic geography of pre-industrial growth, and productivity over the long-run.
Keywords:	machine learning; predictive modeling; wages; black death; industrial revolution
JEL:	J31 C53 N33 N13 N63
Date:	2025–06–13
URL:	https://d.repec.org/n?u=RePEc:ehl:lserod:128852

Predicting Financial Market Crises using Multilayer Network Analysis and LSTM-based Forecasting of Spillover Effects

By:	Mahdi Kohan Sefidi
Abstract:	Financial crises often occur without warning, yet markets leading up to these events display increasing volatility and complex interdependencies across multiple sectors. This study proposes a novel approach to predicting market crises by combining multilayer network analysis with Long Short-Term Memory (LSTM) models, using Granger causality to capture within-layer connections and Random Forest to model interlayer relationships. Specifically, we utilize Granger causality to model the temporal dependencies between market variables within individual layers, such as asset prices, trading values, and returns. To represent the interactions between different market variables across sectors, we apply Random Forest to model the interlayer connections, capturing the spillover effects between these features. The LSTM model is then trained to predict market instability and potential crises based on the dynamic features of the multilayer network. Our results demonstrate that this integrated approach, combining Granger causality, Random Forest, and LSTM, significantly enhances the accuracy of market crisis prediction, outperforming traditional forecasting models. This methodology provides a powerful tool for financial institutions and policymakers to better monitor systemic risks and take proactive measures to mitigate financial crises.
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2505.11019

FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs

By:	Junzhe Jiang; Chang Yang; Aixin Cui; Sihan Jin; Ruiyu Wang; Bo Li; Xiao Huang; Dongning Sun; Xinrun Wang
Abstract:	Financial tasks are pivotal to global economic stability; however, their execution faces challenges including labor intensive processes, low error tolerance, data fragmentation, and tool limitations. Although large language models (LLMs) have succeeded in various natural language processing tasks and have shown potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance lack sufficient domain-specific data, have simplistic task design, and incomplete evaluation frameworks. To address these gaps, this article presents FinMaster, a comprehensive financial benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, FinMaster comprises three main modules: i) FinSim, which builds simulators that generate synthetic, privacy-compliant financial data for companies to replicate market dynamics; ii) FinSuite, which provides tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) FinEval, which develops a unified interface for evaluation. Extensive experiments over state-of-the-art LLMs reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 40% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations initially demonstrating 58% accuracy decreased to 37% in multimetric scenarios. To the best of our knowledge, FinMaster is the first benchmark that covers full-pipeline financial workflows with challenging tasks. We hope that FinMaster can bridge the gap between research and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance efficiency and accuracy.
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2505.13533

CATS: Clustering-Aggregated and Time Series for Business Customer Purchase Intention Prediction

By:	Yingjie Kuang; Tianchen Zhang; Zhen-Wei Huang; Zhongjie Zeng; Zhe-Yuan Li; Ling Huang; Yuefang Gao
Abstract:	Accurately predicting customers' purchase intentions is critical to the success of a business strategy. Current researches mainly focus on analyzing the specific types of products that customers are likely to purchase in the future, little attention has been paid to the critical factor of whether customers will engage in repurchase behavior. Predicting whether a customer will make the next purchase is a classic time series forecasting task. However, in real-world purchasing behavior, customer groups typically exhibit imbalance - i.e., there are a large number of occasional buyers and a small number of loyal customers. This head-to-tail distribution makes traditional time series forecasting methods face certain limitations when dealing with such problems. To address the above challenges, this paper proposes a unified Clustering and Attention mechanism GRU model (CAGRU) that leverages multi-modal data for customer purchase intention prediction. The framework first performs customer profiling with respect to the customer characteristics and clusters the customers to delineate the different customer clusters that contain similar features. Then, the time series features of different customer clusters are extracted by GRU neural network and an attention mechanism is introduced to capture the significance of sequence locations. Furthermore, to mitigate the head-to-tail distribution of customer segments, we train the model separately for each customer segment, to adapt and capture more accurately the differences in behavioral characteristics between different customer segments, as well as the similar characteristics of the customers within the same customer segment. We constructed four datasets and conducted extensive experiments to demonstrate the superiority of the proposed CAGRU approach.
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2505.13558

Nutritional Benefits of Fostering: Evidence from Longitudinal Data in South Africa

By:	Dumas, Christelle (University of Fribourg, Switzerland); Gautrain, Elsa (University of Fribourg, Switzerland); Gosselin-Pali, Adrien (Université Clermont Auvergne)
Abstract:	In sub-Saharan Africa, child fosteringâ€”a widespread practice in which a child moves out of the household of her biological parentsâ€”can have significant implications for a childâ€™s overall well-being. Using longitudinal data from South Africa that includes individual tracking, we employ double machine learning techniques to evaluate the impact of fostering on nutrition, addressing biases related to selection into treatment and endogenous attrition, two common challenges in the literature. Our findings reveal that fostering reduces the probability of being stunted by 6.8 percentage points, corresponding to a 37 percent reduction compared to the mean prevalence. This improvement appears to be driven by foster children relocating to smaller, rural households, often including retired individuals, typically grandparents, who receive a pension. Furthermore, we find that it not only enhances the nutritional status of foster children but also benefits the nutrition of other children from sending households, suggesting that fostering can be mutually beneficial for both groups.
Keywords:	Child Fostering; Nutrition; Machine Learning; South Africa
JEL:	I15 J12 J13 O15 C14
Date:	2025–07–01
URL:	https://d.repec.org/n?u=RePEc:fri:fribow:fribow00542

Measuring Family (Dis)Advantage: Lessons from Detailed Parental Information

By:	Sander de Vries (Vrije Universiteit Amsterdam and Tinbergen Institute)
Abstract:	This paper provides new insights on the importance of family background by linking 1.7 million Dutch childrenâ€™s incomes to an exceptionally rich set of family characteristics â€” including income, wealth, education, occupation, crime, and health. Using a machine learning approach, I show that conventional analyses using parental income only considerably underestimate intergenerational dependence. This underestimation is concentrated at the extremes of the child income distribution, where families are often (dis)advantaged across multiple dimensions. Gender differences in intergenerational dependence are minimal, despite allowing for complex gender-specific patterns. A comparison with adoptees highlights the role of pre-birth factors in driving intergenerational transmission.
Keywords:	Intergenerational mobility, inequality of opportunity
JEL:	I24 J24 J62
Date:	2025–02–14
URL:	https://d.repec.org/n?u=RePEc:tin:wpaper:20250010

Ethnic and gender bias in Large Language Models across contexts

By:	Capistrano, Daniel (University College Dublin); Creighton, Mathew (University College Dublin); Fernández-Reino, Mariña
Abstract:	In this study, we assessed if Large Language Models provided biased answers when prompted to assist with the evaluation of requests made by individuals with different ethnic backgrounds and gender. We emulated an experimental procedure traditionally used in correspondence studies to test discrimination in social settings. The preference given as recommendation from the language models were compared across groups revealing a significant bias against names associated with ethnic minorities, particularly in the housing domain. However, the magnitude of this ethnic bias as well as differences by gender depended on the context mentioned in the prompt to the model. Finally, directing the model to take into consideration regulatory provisions on Artificial Intelligence or potential gender and ethnic discrimination does not seem to mitigate the observed bias between groups.
Date:	2025–07–06
URL:	https://d.repec.org/n?u=RePEc:osf:socarx:9zusq_v1

AI shrinkage: a data-driven approach for risk-optimized portfolios

By:	Gianluca De Nard; Damjan Kostovic
Abstract:	The paper introduces a new type of shrinkage estimation that is not based on asymptotic optimality but uses artificial intelligence (AI) techniques to shrink the sample eigenvalues. The proposed AI Shrinkage estimator applies to both linear and nonlinear shrinkage, demonstrating improved performance compared to the classic shrinkage estimators. Our results demonstrate that reinforcement learning solutions identify a downward bias in classic shrinkage intensity estimates derived under the i.i.d. assumption and automatically correct for it in response to prevailing market conditions. Additionally, our data-driven approach enables more efficient implementation of risk-optimized portfolios and is well-suited for real-world investment applications including various optimization constraints.
Keywords:	Covariance matrix estimation, linear and nonlinear shrinkage, portfolio management reinforcement learning, risk optimization
JEL:	C13 C58 G11
Date:	2025–05
URL:	https://d.repec.org/n?u=RePEc:zur:econwp:470

Can NASDAQ-100 derivatives ETF portfolio beat QQQ?

By:	Lo, Chi-Sheng
Abstract:	This study explores whether a NASDAQ-100 derivatives ETF portfolio can outperform the Invesco QQQ Trust (QQQ) using a Deep Reinforcement Learning framework based on Proximal Policy Optimization (PPO). The portfolio dynamically allocates across three NASDAQ-100 derivative ETFs: YQQQ (short options income), QYLD (covered calls), and TQQQ (3x leveraged), employing Isolation Forest anomaly detection to optimize rebalancing timing. A train-validation-test framework (2010-2018 training, 2019-2023 validation, 2024-2025 testing) utilizes a multi-objective function to balance tracking error minimization and excess return maximization, integrating dividend payments and quarterly with event-driven rebalancing. The results show significant alpha generation over QQQ by leveraging YQQQ’s inverse exposure, QYLD’s income stability, and TQQQ’s leveraged growth. Though experiencing higher volatility and drawdowns, the PPO agent skillfully optimizes allocations, achieving positive excess returns in the testing phase, with performance varying by market condition, emphasizing the need for adaptive strategies in dynamic markets.
Keywords:	Deep reinforcement learning, enhanced index tracking, isolation forest, QQQ, Nasdaq 100, exchange traded fund, options derivatives
JEL:	C32 C44 C61
Date:	2025–07–10
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:125307

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.