nep-big 2025-01-20 papers

on Big Data

Issue of 2025–01–20
23 papers chosen by
Tom Coupé, University of Canterbury

Can central bankers’ talk predict bank stock returns? A machine learning approach By Katsafados, Apostolos G.; Leledakis, George N.; Panagiotou, Nikolaos P.; Pyrgiotakis, Emmanouil G.
High-Resolution Income Estimates Using Satellite Imagery: A Deep Learning Approach applied in Buenos Aires By Abbate Nicolás Francisco; Gasparini Leonardo; Ronchetti Franco; Quiroga Facundo
Nowcasting Peruvian GDP with Machine Learning Methods By Jairo Flores; Bruno Gonzaga; Walter Ruelas-Huanca; Juan Tang
Approaches to risk analysis in the financial sector based on machine learning and artificial intelligence methods By Dyakonova, Ludmila; Konstantinov, Alexey
Exploring the Thai Job Market Through the Lens of Natural Language Processing and Machine Learning By Nuttapol Lertmethaphat; Nuarpear Lekfuangfu; Pucktada Treeratpituk
ESG Tendencies from News - Investigated by AI Trained by Human Intelligence By Li, Chao; Keeley, Alexander Ryota; Takeda, Shutaro; Seki, Daikichi; Managi, Shunsuke
Is Local Taxation Predictable? A Machine Learning Approach By Caravaggio, Nicola; Resce, Giuliano; Idola Francesca, Spanò
Predicting Delays in Cohesion Infrastructure Projects By Coco, Giuseppe; Monturano, Gianluca; Resce, Giuliano
Predicción de inflación en Argentina con métodos econométricos clásicos y machine learning By Aguilar Rafael
Estimation and Inference for a Class of Generalized Hierarchical Models By Chaohua Dong; Jiti Gao; Bin Peng; Yayi Yan
Apple Vision Pro: A Reddit-Based Sentiment Analysis By Koukopoulos, Anastasios; Farmakis, Timoleon; Katiaj, Pavlina; Fraidaki, Katerina; Kavatha, Marina
A Geospatial Analysis of Food Insecurity Among Refugee Households in Lebanon Using Machine Learning Techniques By Angela C. Lyons; Josephine Kass-Hanna; Deepika Pingali; Aiman Soliman; David Zhu; Yifang Zhang; Alejandro Montoya Castano
HMM-LSTM Fusion Model for Economic Forecasting By Guhan Sivakumar
DisSim-FinBERT: Text Simplification for Core Message Extraction in Complex Financial Texts By Wonseong Kim; Christina Niklaus; Choong Lyol Lee; Siegfried Handschuh
Investor’s ESG Tendency Probed by Pre-trained Transformers By Li, Chao; Keeley, Alexander Ryota; Takeda, Shutaro; Seki, Daikichi; Managi, Shunsuke
RUM-NN: A Neural Network Model Compatible with Random Utility Maximisation for Discrete Choice Setups By Niousha Bagheri; Milad Ghasri; Michael Barlow
Quantifying A Firm's AI Engagement: Constructing Objective, Data-Driven, AI Stock Indices Using 10-K Filings By Lennart Ante; Aman Saggu
Causal Effects and Optimal Policy Learning for Intensive Care Unit Discharge Decisions to Solve Hospital Process Bottlenecks: Approach, Methods, and First Results By Vogel, Justus; Cordier, Johannes; Filipovic, Miodrag
Synthetic surveys of monetary policymakers: perceptions, narratives and transparency By Aromí J. Daniel; Heymann Daniel
Patenting Propensity in Italy: A Machine Learning Approach to Regional Clustering By Leogrande, Angelo; Drago, Carlo; Mallardi, Giulio; Costantiello, Alberto; Magaletti, Nicola
A primer for the use of classifier and generative large language models in social science research By Cova, Joshua; Schmitz, Luuk
We Need to Talk: Audio Surveys and Information Extraction By Galasso, Vincenzo; Nannicini, Tommaso; Nozza, Debora
How Central Bank Independence Shapes Monetary Policy Communication: A Large Language Model Application By Leek, Lauren Caroline; Bischl, Simeon

Can central bankers’ talk predict bank stock returns? A machine learning approach

By:	Katsafados, Apostolos G.; Leledakis, George N.; Panagiotou, Nikolaos P.; Pyrgiotakis, Emmanouil G.
Abstract:	We combine machine learning algorithms (ML) with textual analysis techniques to forecast bank stock returns. Our textual features are derived from press releases of the Federal Open Market Committee (FOMC). We show that ML models produce more accurate out-of-sample predictions than OLS regressions, and that textual features can be more informative inputs than traditional financial variables. However, we achieve the highest predictive accuracy by training ML models on a combination of both financial variables and textual data. Importantly, portfolios constructed using the predictions of our best performing ML model consistently outperform their benchmarks. Our findings add to the scarce literature on bank return predictability and have important implications for investors.
Keywords:	Bank stock prediction; Trading strategies; Machine learning; Press conferences; Natural language processing; Banks
JEL:	C53 C88 G00 G11 G12 G14 G17 G21
Date:	2024–10
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:122899

High-Resolution Income Estimates Using Satellite Imagery: A Deep Learning Approach applied in Buenos Aires

By:	Abbate Nicolás Francisco; Gasparini Leonardo; Ronchetti Franco; Quiroga Facundo
Abstract:	In this study, we examine the potential of using high-resolution satellite imagery and machine learning techniques to create income maps with a high level of geographic detail. We trained a convolutional neural network with satellite images from the Metropolitan Area of Buenos Aires (Argentina) and 2010 census data to estimate per capita income at a 50x50 meter resolution for 2013, 2018, and 2022. This outperformed the resolution and frequency of available census information. Based on the EfficientnetV2 architecture, the model achieved high accuracy in predicting household incomes ($R^2=0.878$), surpassing the spatial resolution and model performance of other methods used in the existing literature. This approach presents new opportunities for the generation of highly disaggregated data, enabling the assessment of public policies at a local scale, providing tools for better targeting of social programs, and reducing the information gap in areas where data is not collected.
JEL:	C81 C45
Date:	2024–11
URL:	https://d.repec.org/n?u=RePEc:aep:anales:4701

Nowcasting Peruvian GDP with Machine Learning Methods

By:	Jairo Flores (Banco Central de Reserva del Perú); Bruno Gonzaga (Banco Central de Reserva del Perú); Walter Ruelas-Huanca (Banco Central de Reserva del Perú); Juan Tang (Banco Central de Reserva del Perú)
Abstract:	This paper explores the application of machine learning (ML) techniques to nowcast the monthly year-over-year growth rate of both total and non-primary GDP in Peru. Using a comprehensive dataset that includes over 170 domestic and international predictors, we assess the predictive performance of 12 ML models, including Lasso, Ridge, Elastic Net, Support Vector Regression, Random Forest, XGBoost, and Neural Networks. The study compares these ML approaches against the traditional Dynamic Factor Model (DFM), which serves as the benchmark for nowcasting in economic research. We treat specific configurations, such as the feature matrix rotations and the dimensionality reduction technique, as hyperparameters that are optimized iteratively by the Tree-Structured Parzen Estimator. Our results show that ML models outperformed DFM in nowcasting total GDP, and that they achieve similar performance to this benchmark in nowcasting non-primary GDP. Furthermore, the bottom-up approach appears to be the most effective practice for nowcasting economic activity, as aggregating sectoral predictions improves the precision of ML methods. The findings indicate that ML models offer a viable and competitive alternative to traditional nowcasting methods.
Keywords:	GDP, Machine Learning, nowcasting
JEL:	C14 C32 E32 E52
Date:	2024–12
URL:	https://d.repec.org/n?u=RePEc:rbp:wpaper:2024-019

Approaches to risk analysis in the financial sector based on machine learning and artificial intelligence methods

By:	Dyakonova, Ludmila; Konstantinov, Alexey
Abstract:	The article studies approaches to improving the forecasting quality of machine learning models in finance. An overview of studies devoted to the application of machine learning models and artificial intelligence in the banking sector is given, both from the point of view of risk management and considering in more detail the applied methods of credit scoring and fraud detection. Aspects of applying explainable artificial intelligence (XAI) methods in financial organizations are considered. To identify the most effective machine learning models, the authors conducted experiments to compare 8 classification models used in the financial sector. The gradient boosting model CatboostClassifier was chosen as the base model. A comparison was carried out for the results obtained on the CatboostClassifier model with the characteristics of the other models: IsolationForest, feature ranking model using Recursive Feature Elimination (RFE), XAI Shapley values method, positive class weight increase models wrapper model. All models were applied to 5 open financial data sets. 1 dataset contains transaction data of credit card transactions, 3 datasets contain data on retail lending, and 1 dataset contains data on consumer lending. Our calculations revealed slight improvement for the models IsolationForest and wrapper model in comparison with the base CatboostClassifier model in terms of ROC_AUC for loan defaults data.
Keywords:	financial risks, credit scoring, fraud detection, machine learning, explainable artificial intelligence methods, Catboost, SHAP
JEL:	C63
Date:	2024–12–10
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:122941

Exploring the Thai Job Market Through the Lens of Natural Language Processing and Machine Learning

By:	Nuttapol Lertmethaphat; Nuarpear Lekfuangfu; Pucktada Treeratpituk
Abstract:	In recent decades, the Beveridge curve, which demonstrates a relationship between unemployment and vacancies, has emerged as a central organizing framework for understanding of labour markets â€“ both for academic as well as central banks. The absence of consistent of the data in Thailand is a fundamental drawback in the utilisation of this important indicator. Data from online job platforms presents an alternative opportunity. However, the first and necessary step is to develop a process that can structure and standardise such data. In this paper, we develop an algorithm that standardise the high-frequency data from job websites, which consists of manually written job titles from major online job posting websites in Thailand (in Thai and English languages) into the International Standard Classification of Occupations codes (ISCO-2008), up to 4-digit level. With Natural Language Processing and machine learning techniques, our methodology automates the process to efficiently deal with the volume and velocity nature of the data. Our approach not only carves a new path for comprehending labour market trends, but also enhances the capacity for monitoring labour market behaviours with higher precision and timeliness. Most of all, it offers a pivotal shift towards leveraging real-time, rich online job postings.
Keywords:	Labour market; Beveridge Curve; Online job platform; Machine Learning; Natural Language Processing; Text Classification; Thailand
JEL:	J2 J3 E24 N35
Date:	2025–01
URL:	https://d.repec.org/n?u=RePEc:pui:dpaper:228

ESG Tendencies from News - Investigated by AI Trained by Human Intelligence

By:	Li, Chao; Keeley, Alexander Ryota; Takeda, Shutaro; Seki, Daikichi; Managi, Shunsuke
Abstract:	We create a large language model with high accuracy to investigate the relatedness between 12 environmental, social, and governance (ESG) topics and more than 2 million news reports. The text match pre-trained transformer (TMPT) with 138, 843, 049 parameters is built to probe whether and how much a news record is connected to a specific topic of interest. The TMPT, based on the transformer structure and a pre-trained model, is an artificial intelligence model trained by more than 200, 000 academic papers. The cross-validation result reveals that the TMPT’s accuracy is 85.73%, which is excellent in zero-shot learning tasks. In addition, combined with sentiment analysis, our research monitors news attitudes and tones towards specific ESG topics daily from September 2021 to September 2023. The results indicate that the media is increasing discussion on social topics, while the news regarding environmental issues is reduced. Moreover, towards almost all topics, the attitudes are gradually becoming positive. Our research highlights the temporal shifts in public perception regarding 12 key ESG issues: ESG has been incrementally accepted by the public. These insights are invaluable for policymakers, corporate leaders, and communities as they navigate sustainable decision-making.
Keywords:	ESG; News; Natural Language Processing; Pre-trained Transformer; Data Mining; Machine Learning
JEL:	G0 H0 M1
Date:	2024–11
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:122757

Is Local Taxation Predictable? A Machine Learning Approach

By:	Caravaggio, Nicola; Resce, Giuliano; Idola Francesca, Spanò
Abstract:	This paper investigates determinants of local tax policy, with a particular focus on personal income tax rates in Italian municipalities. By employing seven Machine Learning (ML) algorithms, we assess and predict tax rate decisions, identifying Random Forest as the most accurate model. Results underscore the critical influence of demographic dynamics, fiscal health, socioeconomic conditions, and institutional quality on tax policy formulation. The findings not only showcase the power of ML in enhancing predictive precision in public finance but also provide actionable insights for policymakers and stakeholders, enabling more informed decision-making and the mitigation of fiscal uncertainties.
Keywords:	Local taxation, Machine learning, Municipalities.
JEL:	C53 H24 H71
Date:	2024–09–24
URL:	https://d.repec.org/n?u=RePEc:mol:ecsdps:esdp24098

Predicting Delays in Cohesion Infrastructure Projects

By:	Coco, Giuseppe; Monturano, Gianluca; Resce, Giuliano
Abstract:	Public investment in infrastructure is essential for economic growth, but delays in project implementation can undermine its benefits. This paper examines the determinants of such delays using data from cohesion projects in Italy. We predict which projects are likely to experience delays and identify the key contributing factors by means of machine learning (ML) techniques. To avoid endogeneity, we use only (lagged) features observed at the start of the project as predictors. Our findings show that socioeconomic factors and institutional weaknesses in various regions play a significant role in these delays. The discipline imposed by rules and strict implementation timing on EU funds seems to work, lending credibility to the hypothesis of the benefit of an external commitment. Results underscore the potential of ML in designing appropriate implementation policies, enhancing project management, and improving the outcomes of public investments.
Keywords:	Territorial cohesion, Administrative efficiency, Machine learning, Project Delays.
JEL:	H77 R58 O18 C55
Date:	2025–01–08
URL:	https://d.repec.org/n?u=RePEc:mol:ecsdps:esdp25099

Predicción de inflación en Argentina con métodos econométricos clásicos y machine learning

By:	Aguilar Rafael
Abstract:	Argentina está en un régimen de alta inflación. Para los policy makers y el resto de los agentes económicos, es crucial contar con predicciones confiables de la inflación. Un hándicap de los modelos predictivos clásicos es que al aumentar la cantidad de variables, también aumenta la varianza de las predicciones. El desarrollo de técnicas de machine learning permite modelar con decenas o centenas de variables y aprovechar la amplitud de datos disponible. En base a trabajos de D’Amato et al. (2018) y Silva Araujo y Piazza Gaglianone (2023), desarrollamos un set de modelos para predecir la inflación argentina entre 2016 y 2024 y comparamos su poder predictivo en distintos horizontes temporales. El ejercicio incluye (i) un modelo ARIMA de serie de tiempo univariado, (ii) modelos de autorregresión vectorial VAR y (iii) regresiones LASSO y ELASTIC NET con elementos de machine learning. Utilizamos como benchmark la mediana de la encuesta REM del BCRA. Encontramos un mejor desempeño de los modelos con machine learning en ambos horizontes. Entre las variables seleccionadas por las regresiones LASSO y ELASTIC NET se destacan la inflación rezagada, el tipo de cambio oficial, los salarios nominales, el ITCRM-BCRA y agregados monetarios como los depósitos y préstamos del sector privado.
JEL:	C22 E31
Date:	2024–11
URL:	https://d.repec.org/n?u=RePEc:aep:anales:4704

Estimation and Inference for a Class of Generalized Hierarchical Models

By:	Chaohua Dong; Jiti Gao; Bin Peng; Yayi Yan
Abstract:	In this paper, we consider estimation and inference for the unknown parameters and function involved in a class of generalized hierarchical models. Such models are of great interest in the literature of neural networks (such as Bauer and Kohler, 2019). We propose a rectified linear unit (ReLU) based deep neural network (DNN) approach, and contribute to the design of DNN by i) providing more transparency for practical implementation, ii) defining different types of sparsity, iii) showing the differentiability, iv) pointing out the set of effective parameters, and v) offering a new variant of rectified linear activation function (ReLU), etc. Asymptotic properties are established accordingly, and a feasible procedure for the purpose of inference is also proposed. We conduct extensive numerical studies to examine the finite-sample performance of the estimation methods, and we also evaluate the empirical relevance and applicability of the proposed models and estimation methods to real data.
Keywords:	Estimation Theory; Deep Neural Network; Hierarchical Model; ReLU
JEL:	C14 C45 G12
Date:	2024
URL:	https://d.repec.org/n?u=RePEc:msh:ebswps:2024-7

Apple Vision Pro: A Reddit-Based Sentiment Analysis

By:	Koukopoulos, Anastasios; Farmakis, Timoleon; Katiaj, Pavlina; Fraidaki, Katerina; Kavatha, Marina
Abstract:	In the digital era, emerging technologies such as Vision Pro are crucial for businesses due to their transformative potential across various industries. As an amalgamation of augmented reality (AR), virtual reality (VR), computer vision, and machine learning, Vision Pro technology represents a frontier in the intersection of human-computer interaction, offering innovative solutions and opening up new avenues for value creation in business. Considering the primary stages of this technology, this study aims to explore the spectrum of reactions in Vision Pro, presenting a sentiment analysis of the 'VisionPro' subreddit, a community dedicated to discussing vision technologies. Through sentiment analysis, we could discern patterns that suggest the factors driving positive and negative reactions within the community. This paper sheds light on the specific sentiments prevalent in the 'VisionPro' subreddit and demonstrates the applicability of sentiment analysis in understanding community dynamics in technology-focused online forums. The findings contribute to the broader discourse on public sentiment towards emerging technologies, offering implications for developers, researchers, and enthusiasts engaged in vision technology.
Keywords:	Vision Pro; Sentiment Analysis; RedditExtractoR; Augmented Reality; Virtual Reality
JEL:	M3
Date:	2024
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:123180

A Geospatial Analysis of Food Insecurity Among Refugee Households in Lebanon Using Machine Learning Techniques

By:	Angela C. Lyons (University of Illinois Urbana-Champaign); Josephine Kass-Hanna (IESEG School of Management, Univ. Lille); Deepika Pingali (University of Illinois Urbana-Champaign); Aiman Soliman (University of Illinois Urbana-Champaign); David Zhu (University of Illinois Urbana-Champaign); Yifang Zhang (University of Illinois Urbana-Champaign); Alejandro Montoya Castano (Colombian Directorate of Taxes and Customs (DIAN), Bogotá)
Abstract:	This study integrates geospatial analysis with machine learning to understand the interplay and spatial dependencies among various indicators of food insecurity. Combining household survey data and novel geospatial data on Syrian refugees in Lebanon, we explore why certain food security measures are effective in specific contexts while others are not. Our findings indicate that geolocational indicators significantly influence food insecurity, often overshadowing traditional factors like household socio-demographics and living conditions. This suggests a shift in focus from labor-intensive socioeconomic surveys to readily accessible geospatial data. The study also highlights the variability of food insecurity across different locations and subpopulations, challenging the effectiveness of individual measures like FCS, HDDS, and rCSI in capturing localized needs. By disaggregating the dimensions of food insecurity and understanding their distribution, humanitarian and development organizations can better tailor strategies, directing resources to areas where refugees face the most severe food challenges. From a policy perspective, our insights call for a refined approach that improves the predictive power of food insecurity models, aiding organizations in efficiently targeting interventions.
Date:	2024–09–20
URL:	https://d.repec.org/n?u=RePEc:erg:wpaper:1729

HMM-LSTM Fusion Model for Economic Forecasting

By:	Guhan Sivakumar
Abstract:	This paper explores the application of Hidden Markov Models (HMM) and Long Short-Term Memory (LSTM) neural networks for economic forecasting, focusing on predicting CPI inflation rates. The study explores a new approach that integrates HMM-derived hidden states and means as additional features for LSTM modeling, aiming to enhance the interpretability and predictive performance of the models. The research begins with data collection and preprocessing, followed by the implementation of the HMM to identify hidden states representing distinct economic conditions. Subsequently, LSTM models are trained using the original and augmented data sets, allowing for comparative analysis and evaluation. The results demonstrate that incorporating HMM-derived data improves the predictive accuracy of LSTM models, particularly in capturing complex temporal patterns and mitigating the impact of volatile economic conditions. Additionally, the paper discusses the implementation of Integrated Gradients for model interpretability and provides insights into the economic dynamics reflected in the forecasting outcomes.
Date:	2025–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2501.02002

DisSim-FinBERT: Text Simplification for Core Message Extraction in Complex Financial Texts

By:	Wonseong Kim; Christina Niklaus; Choong Lyol Lee; Siegfried Handschuh
Abstract:	This research integrates Discourse Simplification (DisSim) into aspect-based sentiment analysis (ABSA) to improve aspect selection and sentiment prediction in complex financial texts, particularly central bank communications. The study focuses on decomposing Federal Open Market Committee (FOMC) minutes into simple, canonical structures to identify key sentences encapsulating the core messages of intricate financial narratives. The investigation examines whether hierarchical segmenting of financial texts can enhance ABSA performance using a pre-trained Financial BERT model. Results indicate that DisSim methods enhance aspect selection accuracy in simplified texts compared to untreated counterparts and show empirical improvement in sentiment prediction. The study concludes that decomposing complex financial texts into shorter segments with Discourse Simplification can lead to more precise aspect selection, thereby facilitating more accurate economic analyses.
Date:	2025–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2501.04959

Investor’s ESG Tendency Probed by Pre-trained Transformers

By:	Li, Chao; Keeley, Alexander Ryota; Takeda, Shutaro; Seki, Daikichi; Managi, Shunsuke
Abstract:	Due to climate change and social issues, environmental, social, and governance (ESG) solutions receive increased attention and emphasis. Being influential market leaders, investors wield significant power to persuade companies to prioritize ESG considerations. However, investors’ preferences for specific ESG topics and changing trends in those preferences remain elusive. Here, we build a group of large language models with 128 million parameters, named classification pre-trained transformers (CPTs), to extract the investors’ tendencies toward 13 ESG-related topics from their annual reports. Assisted by the CPT models with approximately 95% cross-validation accuracy, more than 3, 000 annual reports released by globally 350 top investors during 2010-2021 are analyzed. Results indicate that although the investors show the strongest tendency toward the economic aspect in their annual reports, the emphasis is gradually reducing and shifting to environmental and social aspects. Nonfinancial investors like corporation and holding company investors prioritize environmental and social factors, whereas financial investors pay the most attention to governance risk. There are differences in patterns at the country level, for instance, Japan’s investors show a greater focus on environmental and social factors than other major countries. Our findings suggest that investors are increasingly valuing sustainability in their decision-making. Different investor businesses may encounter unique ESG challenges, necessitating individualized strategies. Companies should improve their ESG disclosures, which are increasingly focused on environmental and social issues, to meet investor expectations and bolster transparency.
Keywords:	ESG; Investor; Natural Language Processing; Pre-trained Transformer; Data Mining; Machine Learning
JEL:	G11 G18 M1
Date:	2024–11
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:122756

RUM-NN: A Neural Network Model Compatible with Random Utility Maximisation for Discrete Choice Setups

By:	Niousha Bagheri; Milad Ghasri; Michael Barlow
Abstract:	This paper introduces a framework for capturing stochasticity of choice probabilities in neural networks, derived from and fully consistent with the Random Utility Maximization (RUM) theory, referred to as RUM-NN. Neural network models show remarkable performance compared with statistical models; however, they are often criticized for their lack of transparency and interoperability. The proposed RUM-NN is introduced in both linear and nonlinear structures. The linear RUM-NN retains the interpretability and identifiability of traditional econometric discrete choice models while using neural network-based estimation techniques. The nonlinear RUM-NN extends the model's flexibility and predictive capabilities to capture nonlinear relationships between variables within utility functions. Additionally, the RUM-NN allows for the implementation of various parametric distributions for unobserved error components in the utility function and captures correlations among error terms. The performance of RUM-NN in parameter recovery and prediction accuracy is rigorously evaluated using synthetic datasets through Monte Carlo experiments. Additionally, RUM-NN is evaluated on the Swissmetro and the London Passenger Mode Choice (LPMC) datasets with different sets of distribution assumptions for the error component. The results demonstrate that RUM-NN under a linear utility structure and IID Gumbel error terms can replicate the performance of the Multinomial Logit (MNL) model, but relaxing those constraints leads to superior performance for both Swissmetro and LPMC datasets. By introducing a novel estimation approach aligned with statistical theories, this study empowers econometricians to harness the advantages of neural network models.
Date:	2025–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2501.05221

Quantifying A Firm's AI Engagement: Constructing Objective, Data-Driven, AI Stock Indices Using 10-K Filings

By:	Lennart Ante; Aman Saggu
Abstract:	Following an analysis of existing AI-related exchange-traded funds (ETFs), we reveal the selection criteria for determining which stocks qualify as AI-related are often opaque and rely on vague phrases and subjective judgments. This paper proposes a new, objective, data-driven approach using natural language processing (NLP) techniques to classify AI stocks by analyzing annual 10-K filings from 3, 395 NASDAQ-listed firms between 2011 and 2023. This analysis quantifies each company's engagement with AI through binary indicators and weighted AI scores based on the frequency and context of AI-related terms. Using these metrics, we construct four AI stock indices-the Equally Weighted AI Index (AII), the Size-Weighted AI Index (SAII), and two Time-Discounted AI Indices (TAII05 and TAII5X)-offering different perspectives on AI investment. We validate our methodology through an event study on the launch of OpenAI's ChatGPT, demonstrating that companies with higher AI engagement saw significantly greater positive abnormal returns, with analyses supporting the predictive power of our AI measures. Our indices perform on par with or surpass 14 existing AI-themed ETFs and the Nasdaq Composite Index in risk-return profiles, market responsiveness, and overall performance, achieving higher average daily returns and risk-adjusted metrics without increased volatility. These results suggest our NLP-based approach offers a reliable, market-responsive, and cost-effective alternative to existing AI-related ETF products. Our innovative methodology can also guide investors, asset managers, and policymakers in using corporate data to construct other thematic portfolios, contributing to a more transparent, data-driven, and competitive approach.
Date:	2025–01
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2501.01763

Causal Effects and Optimal Policy Learning for Intensive Care Unit Discharge Decisions to Solve Hospital Process Bottlenecks: Approach, Methods, and First Results

By:	Vogel, Justus; Cordier, Johannes; Filipovic, Miodrag
Abstract:	Intensive care units (ICUs) operate with fixed capacities and face uncertainty such as demand variability, leading to demand-driven, early discharges to free up beds. These discharges can increase readmission rates, negatively impacting patient outcomes and aggravating ICU bottleneck congestion. This study investigates how ICU discharge timing affects readmission risk, with the goal of developing policies that minimize ICU readmissions, managing demand variability and bed capacity. To define a binary treatment, we randomly assign hypothetical discharge days to patients, comparing these with actual discharge days to form intervention and control groups. We apply two causal machine learning techniques (generalized random forest, modified causal forest). Assuming unconfoundedness, we leverage observed patient data as sufficient covariates. For scenarios where unconfoundedness might fail, we discuss an IV approach with different instruments. We further develop decision policies based on individualized average treatment effects (IATEs) to minimize individual patients' readmission risk. We find that for 72% of our sample (roughly 12, 000 cases), admission at point in time 𝑡 as compared to 𝑡+1 increases their readmission risk. Vice versa, 28% of cases profit from an earlier discharge in terms of readmission risk. To develop decision policies, we rank patients according to their IATE, and compare IATE rankings for instances, when demand exceeds the available capacity. Finally, we outline how we will assess the potential reduction in readmissions and saved bed capacities under optimal policies in a simulation, offering actionable insights for ICU management. We aim to provide a novel approach and blueprint for similar operations research and management science applications in data-rich environments.
Keywords:	Causal Machine Learning, Intensive Care Unit Management, Hospital Operations, Policy Learning
JEL:	I10 C44
Date:	2025
URL:	https://d.repec.org/n?u=RePEc:zbw:hsgmed:202501

Synthetic surveys of monetary policymakers: perceptions, narratives and transparency

By:	Aromí J. Daniel; Heymann Daniel
Abstract:	We propose a method to generate “synthetic surveys” that reveal policymakers’ perceptions and narratives. This exercise is implemented using 80 time-stamped Large Language Models (LLMs) fine-tuned with FOMC meetings’ transcripts. Given a text input, fine-tuned models identify highly likely responses for the corresponding FOMC meeting. We demonstrate the value of this tool in three different tasks: measurement of perceived economic conditions, evaluation of transparency in Central Bank communication and extraction of policymaking narratives. Our analysis covers the housing bubble and the subsequent Great Recession (2003-2012). For the first task, LLMs are prompted to generate phrases that describe economic conditions. The resulting outputs show policymakers informational advantage. Anticipatory ability increases as models are prompted to discuss future scenarios and financial conditions. To analyze transparency, we compare the content of each FOMC meeting minutes to content generated synthetically through the corresponding fine-tuned LLM. The evaluation suggests the tone of each meeting is transmitted adequately by the corresponding minutes. In the third task, LLMs produce narratives that show policymakers’ views on their responsibilities and their understanding of main forces shaping macroeconomic dynamics.
JEL:	E58 E47
Date:	2024–11
URL:	https://d.repec.org/n?u=RePEc:aep:anales:4707

Patenting Propensity in Italy: A Machine Learning Approach to Regional Clustering

By:	Leogrande, Angelo; Drago, Carlo; Mallardi, Giulio; Costantiello, Alberto; Magaletti, Nicola
Abstract:	This article focuses on the propensity to patent across Italian regions, considering data from ISTAT-BES between 2004 and 2019 to contribute to analyzing regional gaps and determinants of innovative performances. Results show how the North-South gap in innovative performance has persisted over time, confirming the relevance of research intensity, digital infrastructure, and cultural employment on patenting activity. These relations have been analyzed using the panel data econometric model. It allows singling out crucial positive drivers like R&D investment or strongly negative factors, such as limited mobility of graduates. More precisely, given the novelty of approaches applied in the used model, the following contributions are represented: first, the fine grain of regional differentiation, from which the sub-national innovation system will be observed. It also puts forward a set of actionable policy recommendations that would contribute to more substantial inclusive innovation, particularly emphasizing less-performing regions. By focusing on such dynamics, this study will indirectly address how regional characteristics and policies shape innovation and technological competitiveness in Italy. Therefore, it contributes to the debate on regional systems of innovation and their possible role in economic development in Europe since the economic, institutional, and technological conditions are differentiated between various areas in Italy.
Keywords:	Innovation, Innovation and Invention, Management of Technological Innovation and R&D, Technological Change, Intellectual Property and Intellectual Capital
JEL:	O30 O31 O32 O33 O34 O35 O38
Date:	2024–12–23
URL:	https://d.repec.org/n?u=RePEc:pra:mprapa:123081

A primer for the use of classifier and generative large language models in social science research

By:	Cova, Joshua; Schmitz, Luuk
Abstract:	The emergence of generative AI models is rapidly changing the social sciences. Much has now been written on the ethics and epistemological considerations of using these tools. Meanwhile, AI-powered research increasingly makes its way to preprint servers. However, we see a gap between ethics and practice: while many researchers would like to use these tools, few if any guides on how to do so exist. This paper fills this gap by providing users with a hands-on application written in accessible language. The paper deals with what we consider the most likely and advanced use case for AI in the social sciences: text annotation and classification. Our application guides readers through setting up a text classification pipeline and evaluating the results. The most important considerations concern reproducibility and transparency, open-source versus closed-source models, as well as the difference between classifier and generative models. The take-home message is this: these models provide unprecedented scale to augment research, but the community must take seriousely open-source and locally deployable models in the interest of open science principles. Our code to reproduce the example can be accessed via Github.
Date:	2024–12–20
URL:	https://d.repec.org/n?u=RePEc:osf:osfxxx:r3qng

We Need to Talk: Audio Surveys and Information Extraction

By:	Galasso, Vincenzo (Bocconi University); Nannicini, Tommaso (European University Institute); Nozza, Debora (Bocconi University)
Abstract:	Understanding individuals' beliefs, preferences, and motivations is essential in social sciences. Recent technological advancements—notably, large language models (LLMs) for analyzing open-ended responses and the diffusion of voice messaging— have the potential to significantly enhance our ability to elicit these dimensions. This study investigates the differences between oral and written responses to open-ended survey questions. Through a series of randomized controlled trials across three surveys (focused on AI, public policy, and international relations), we assigned respondents to answer either by audio or text. Respondents who provided audio answers gave longer, though lexically simpler, responses compared to those who typed. By leveraging LLMs, we evaluated answer informativeness and found that oral responses differ in both quantity and quality, offering more information and containing more personal experiences than written responses. These findings suggest that oral responses to open-ended questions can capture richer, more personal insights, presenting a valuable method for understanding individual reasoning.
Keywords:	survey design, open-ended questions, large language models, beliefs
JEL:	C83 D83
Date:	2024–11
URL:	https://d.repec.org/n?u=RePEc:iza:izadps:dp17488

How Central Bank Independence Shapes Monetary Policy Communication: A Large Language Model Application

By:	Leek, Lauren Caroline (European University Institute); Bischl, Simeon
Abstract:	Although central bank communication is a core monetary policy and accountability tool for central banks, little is known about what shapes it. This paper develops and tests a theory regarding a previously unconsidered variable: central bank independence (CBI). We argue that increases in CBI alter the pressures a central bank faces, compelling them to address these pressures to maintain their reputation. We fine-tune and validate a Large Language Model (Google's Gemini) to develop novel textual indices of policy pressures regarding monetary policy communication of central banks in speeches of 100 central banks from 1997 to 2023. Employing a staggered difference-in-differences and an instrumental variable approach, we find robust evidence that an increase in independence decreases monetary pressures and increases financial pressures discussed in monetary policy communication. These results are not, as generally is assumed, confounded by general changes in communication over time or singular events, in particular, the Global Financial Crisis.
Date:	2024–11–25
URL:	https://d.repec.org/n?u=RePEc:osf:socarx:yrhka

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.