nep-big New Economics Papers
on Big Data
Issue of 2024‒10‒14
thirty papers chosen by
Tom Coupé, University of Canterbury


  1. An Anatomy of Firms’ Political Speech By Pablo Ottonello; Wenting Song; Sebastian Sotelo
  2. ECB’s Climate Speeches and Market Reactions. By Antoine Ebeling
  3. COMEX Copper Futures Volatility Forecasting: Econometric Models and Deep Learning By Zian Wang; Xinyi Lu
  4. A Deep Reinforcement Learning Framework For Financial Portfolio Management By Jinyang Li
  5. Forecasting recessions in Germany with machine learning By Rademacher, Philip
  6. Analyzing Regional Disparities in E-Commerce Adoption Among Italian SMEs: Integrating Machine Learning Clustering and Predictive Models with Econometric Analysis By Leogrande, Angelo; Drago, Carlo; Arnone, Massimo
  7. Comparative Study of Long Short-Term Memory (LSTM) and Quantum Long Short-Term Memory (QLSTM): Prediction of Stock Market Movement By Tariq Mahmood; Ibtasam Ahmad; Malik Muhammad Zeeshan Ansar; Jumanah Ahmed Darwish; Rehan Ahmad Khan Sherwani
  8. Evaluating Credit VIX (CDS IV) Prediction Methods with Incremental Batch Learning By Robert Taylor
  9. Improving the Finite Sample Performance of Double/Debiased Machine Learning with Propensity Score Calibration By Daniele Ballinari; Nora Bearth
  10. Research and Design of a Financial Intelligent Risk Control Platform Based on Big Data Analysis and Deep Machine Learning By Shuochen Bi; Yufan Lian; Ziyue Wang
  11. Dynamic Link and Flow Prediction in Bank Transfer Networks By Shu Takahashi; Kento Yamamoto; Shumpei Kobayashi; Ryoma Kondo; Ryohei Hisano
  12. Ethereum Fraud Detection via Joint Transaction Language Model and Graph Representation Learning By Yifan Jia; Yanbin Wang; Jianguo Sun; Yiwei Liu; Zhang Sheng; Ye Tian
  13. Climate Change through the Lens of Macroeconomic Modeling By Jesús Fernández-Villaverde; Kenneth Gillingham; Simon Scheidegger
  14. Machine Learning Methods for Surge Rate Prediction: A Case Study of Yassir By Mukherjee, Krishnendu
  15. StockTime: A Time Series Specialized Large Language Model Architecture for Stock Price Prediction By Shengkun Wang; Taoran Ji; Linhan Wang; Yanshen Sun; Shang-Ching Liu; Amit Kumar; Chang-Tien Lu
  16. Advancing Financial Forecasting: A Comparative Analysis of Neural Forecasting Models N-HiTS and N-BEATS By Mohit Apte; Yashodhara Haribhakta
  17. The effects of data preprocessing on probability of default model fairness By Di Wu
  18. LSR-IGRU: Stock Trend Prediction Based on Long Short-Term Relationships and Improved GRU By Peng Zhu; Yuante Li; Yifan Hu; Qinyuan Liu; Dawei Cheng; Yuqi Liang
  19. Disentangling the sources of cyber risk premia By Lo\"ic Mar\'echal; Nathan Monnet
  20. Biases in inequality of opportunity estimates: measures and solutions By Moramarco, Domenico; Brunori, Paolo; Salas-Rojo, Pedro
  21. Credit Scores: Performance and Equity By Stefania Albanesi; Domonkos F. Vamossy
  22. Multidimensional poverty in Benin By Esmeralda Arranhado; Lágida Barbosa; João A. Bastos
  23. Identification of an Expanded Inventory of Green Job Titles through AI-Driven Text Mining By Paliński, Michał; Aşık, Gunes A.; Gajderowicz, Tomasz; Jakubowski, Maciej; Nas Özen, Efşan; Raju, Dhushyanth
  24. Targeting and Effectiveness of Location-Based Policies By Carrieri, Vincenzo; de Blasio, G.; Ferrara, Andreas; Nisticò, Rosanna
  25. MoA is All You Need: Building LLM Research Team using Mixture of Agents By Sandy Chen; Leqi Zeng; Abhinav Raghunathan; Flora Huang; Terrence C. Kim
  26. Regime-Switching Factor Models and Nowcasting with Big Data By Omer Faruk Akbal
  27. House Prices, Debt Burdens, and the Heterogeneous Effects of Mortgage Rate Shocks By Gary Cornwall; Marina Gindelsky
  28. Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs By Ziyan Cui; Ning Li; Huaikang Zhou
  29. An Experimental Study of Competitive Market Behavior Through LLMs By Jingru Jia; Zehua Yuan
  30. The Mismeasure of Weather: Using Remotely Sensed Earth Observation Data in Economic Context By Anna Josephson; Jeffrey D. Michler; Talip Kilic; Siobhan Murray

  1. By: Pablo Ottonello; Wenting Song; Sebastian Sotelo
    Abstract: We study the distribution of political speech across U.S. firms. We develop a measure of political engagement based on firms’ communications (earnings calls, regulatory filings, and social media), by training a large language model to identify statements that contain political opinions. Using these data, we document five facts about firms’ political engagement. (1) Political engagement is rare among firms. (2) Political engagement is concentrated among large firms. (3) Firms tend to specialize in specific topics and outlets. (4) Large firms tend to engage in a wider set of topics and outlets. (5) The 2020 surge in firms’ political engagement was associated with an increase in the engagement of medium-sized firms and a change in the mix of political topics.
    JEL: C8 D22 D72 G3 L1 M14
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:32923
  2. By: Antoine Ebeling
    Abstract: This paper study the impact of the European Central Bank’s (ECB) climate related speeches on European stock markets. Using the database of 2594 speeches between 1997 and 2022 of the European Central Bank, we employ advanced textual analysis techniques, including keyword identification and topic modeling, to isolate speeches related to climate change. We then conduct an event study to estimate the differences in abnormal returns of a large panel of listed companies in response to the European Central Bank’s speeches on climate change. Our analysis reveals that the ECB’s communication on climate issues has intensified significantly since 2015. Using topic modelling methods, we classify climate speeches into two main themes: (i) green finance and economic policies, and (ii) climate-related risks The event study shows that financial markets tend to reallocate portfolios towards greener ones in the days following the ECB’s climate speeches. Our results show that following a climatic speech by the ECB, green financial markets are benefiting from positive abnormal returns by around 1 percentage point. More specifically, we find that climate speeches dealing with green monetary policy and other economic policy instruments have a larger effect on green stock prices than speeches dealing with different types of climate risk.
    Keywords: Central bank communication ; Climate change ; Event Study ; Textual Analysis.
    JEL: E52 G14 Q54
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ulp:sbbeta:2024-38
  3. By: Zian Wang; Xinyi Lu
    Abstract: This paper investigates the forecasting performance of COMEX copper futures realized volatility across various high-frequency intervals using both econometric volatility models and deep learning recurrent neural network models. The econometric models considered are GARCH and HAR, while the deep learning models include RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), and GRU (Gated Recurrent Unit). In forecasting daily realized volatility for COMEX copper futures with a rolling window approach, the econometric models, particularly HAR, outperform recurrent neural networks overall, with HAR achieving the lowest QLIKE loss function value. However, when the data is replaced with hourly high-frequency realized volatility, the deep learning models outperform the GARCH model, and HAR attains a comparable QLIKE loss function value. Despite the black-box nature of machine learning models, the deep learning models demonstrate superior forecasting performance, surpassing the fixed QLIKE value of HAR in the experiment. Moreover, as the forecast horizon extends for daily realized volatility, deep learning models gradually close the performance gap with the GARCH model in certain loss function metrics. Nonetheless, HAR remains the most effective model overall for daily realized volatility forecasting in copper futures.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08356
  4. By: Jinyang Li
    Abstract: In this research paper, we investigate into a paper named "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem" [arXiv:1706.10059]. It is a portfolio management problem which is solved by deep learning techniques. The original paper proposes a financial-model-free reinforcement learning framework, which consists of the Ensemble of Identical Independent Evaluators (EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning (OSBL) scheme, and a fully exploiting and explicit reward function. Three different instants are used to realize this framework, namely a Convolutional Neural Network (CNN), a basic Recurrent Neural Network (RNN), and a Long Short-Term Memory (LSTM). The performance is then examined by comparing to a number of recently reviewed or published portfolio-selection strategies. We have successfully replicated their implementations and evaluations. Besides, we further apply this framework in the stock market, instead of the cryptocurrency market that the original paper uses. The experiment in the cryptocurrency market is consistent with the original paper, which achieve superior returns. But it doesn't perform as well when applied in the stock market.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08426
  5. By: Rademacher, Philip
    Abstract: This paper applies machine learning to forecast business cycles in the German economy using a high-dimensional dataset with 73 indicators, primarily from the OECD Main Economic Indicator Database, covering a time period from 1973 to 2023. Sequential Floating Forward Selection (SFFS) is used to select the most relevant indicators and build compact, explainable, and performant models. Therefore, regularized regression models (LASSO, Ridge) and tree-based classification models (Random Forest, and Logit Boost) are used as challenger models to outperform a probit model containing the term spread as a predictor. All models are trained on data from 1973-2006 and evaluated on a hold-out-sample starting in 2006. The study reveals that fewer indicators are necessary to model recessions. Models built with SFFS have a maximum of eleven indicators. Furthermore, the study setting shows that many indicators are stable across time and business cycles. Machine learning models prove particularly effective in predicting recessions during periods of quantitative easing, when the predictive power of the term spread diminishes. The findings contribute to the ongoing discussion on the use of machine learning in economic forecasting, especially in the context of limited and imbalanced data.
    Keywords: Business Cycles, Recession, Forecasting, Machine Learning
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:zbw:dicedp:303050
  6. By: Leogrande, Angelo; Drago, Carlo; Arnone, Massimo
    Abstract: The article explores the diffusion of online sales tools among Italian enterprises with at least ten employees, considering regional inequalities through methods that help address economic policy. The study gives an overall assessment of the adoption of e-commerce among Italian SMEs, using multiple methods that help to identify regional disparities and provide insight for policymakers. The data were obtained from the ISTAT-BES database. Analysis was applied using the k-Means machine learning algorithm by comparing the Silhouette coefficient vs. the Elbow method. The elbow method reveals greater expository capacity, and the optimal number of clusters equals 3. The econometric analysis used the following methods: Panel Data with Fixed Effects, Panel Data with Random Effects, Weighted Least Squares-WLS, and Dynamic Panels at 1 Stage. The results show that cultural and creative employment and regular internet users are positively associated with SMEs active in e-commerce while negatively associated with the family's availability of at least one computer and internet connection. Finally, the article compares different machine learning algorithms to predict the future value of SMEs active in e-commerce. The results are discussed critically.
    Keywords: e-Commerce, Small and Medium Enterprises, Regional Inequalities, Panel Data, k-Means, Machine-Learning.
    JEL: O3 O30 O31 O32 O33 O34 O38
    Date: 2024–11
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:122115
  7. By: Tariq Mahmood; Ibtasam Ahmad; Malik Muhammad Zeeshan Ansar; Jumanah Ahmed Darwish; Rehan Ahmad Khan Sherwani
    Abstract: In recent years, financial analysts have been trying to develop models to predict the movement of a stock price index. The task becomes challenging in vague economic, social, and political situations like in Pakistan. In this study, we employed efficient models of machine learning such as long short-term memory (LSTM) and quantum long short-term memory (QLSTM) to predict the Karachi Stock Exchange (KSE) 100 index by taking monthly data of twenty-six economic, social, political, and administrative indicators from February 2004 to December 2020. The comparative results of LSTM and QLSTM predicted values of the KSE 100 index with the actual values suggested QLSTM a potential technique to predict stock market trends.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08297
  8. By: Robert Taylor
    Abstract: This paper presents the experimental process and results of SVM, Gradient Boosting, and an Attention-GRU Hybrid model in predicting the Implied Volatility of rolled-over five-year spread contracts of credit default swaps (CDS) on European corporate debt during the quarter following mid-May '24, as represented by the iTraxx/Cboe Europe Main 1-Month Volatility Index (BP Volatility). The analysis employs a feature matrix inspired by Merton's determinants of default probability. Our comparative assessment aims to identify strengths in SOTA and classical machine learning methods for financial risk prediction
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2408.15404
  9. By: Daniele Ballinari; Nora Bearth
    Abstract: Machine learning techniques are widely used for estimating causal effects. Double/debiased machine learning (DML) (Chernozhukov et al., 2018) uses a double-robust score function that relies on the prediction of nuisance functions, such as the propensity score, which is the probability of treatment assignment conditional on covariates. Estimators relying on double-robust score functions are highly sensitive to errors in propensity score predictions. Machine learners increase the severity of this problem as they tend to over- or underestimate these probabilities. Several calibration approaches have been proposed to improve probabilistic forecasts of machine learners. This paper investigates the use of probability calibration approaches within the DML framework. Simulation results demonstrate that calibrating propensity scores may significantly reduces the root mean squared error of DML estimates of the average treatment effect in finite samples. We showcase it in an empirical example and provide conditions under which calibration does not alter the asymptotic properties of the DML estimator.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.04874
  10. By: Shuochen Bi; Yufan Lian; Ziyue Wang
    Abstract: In the financial field of the United States, the application of big data technology has become one of the important means for financial institutions to enhance competitiveness and reduce risks. The core objective of this article is to explore how to fully utilize big data technology to achieve complete integration of internal and external data of financial institutions, and create an efficient and reliable platform for big data collection, storage, and analysis. With the continuous expansion and innovation of financial business, traditional risk management models are no longer able to meet the increasingly complex market demands. This article adopts big data mining and real-time streaming data processing technology to monitor, analyze, and alert various business data. Through statistical analysis of historical data and precise mining of customer transaction behavior and relationships, potential risks can be more accurately identified and timely responses can be made. This article designs and implements a financial big data intelligent risk control platform. This platform not only achieves effective integration, storage, and analysis of internal and external data of financial institutions, but also intelligently displays customer characteristics and their related relationships, as well as intelligent supervision of various risk information
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.10331
  11. By: Shu Takahashi; Kento Yamamoto; Shumpei Kobayashi; Ryoma Kondo; Ryohei Hisano
    Abstract: The prediction of both the existence and weight of network links at future time points is essential as complex networks evolve over time. Traditional methods, such as vector autoregression and factor models, have been applied to small, dense networks, but become computationally impractical for large-scale, sparse, and complex networks. Some machine learning models address dynamic link prediction, but few address the simultaneous prediction of both link presence and weight. Therefore, we introduce a novel model that dynamically predicts link presence and weight by dividing the task into two sub-tasks: predicting remittance ratios and forecasting the total remittance volume. We use a self-attention mechanism that combines temporal-topological neighborhood features to predict remittance ratios and use a separate model to forecast the total remittance volume. We achieve the final prediction by multiplying the outputs of these models. We validated our approach using two real-world datasets: a cryptocurrency network and bank transfer network.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08718
  12. By: Yifan Jia; Yanbin Wang; Jianguo Sun; Yiwei Liu; Zhang Sheng; Ye Tian
    Abstract: Ethereum faces growing fraud threats. Current fraud detection methods, whether employing graph neural networks or sequence models, fail to consider the semantic information and similarity patterns within transactions. Moreover, these approaches do not leverage the potential synergistic benefits of combining both types of models. To address these challenges, we propose TLMG4Eth that combines a transaction language model with graph-based methods to capture semantic, similarity, and structural features of transaction data in Ethereum. We first propose a transaction language model that converts numerical transaction data into meaningful transaction sentences, enabling the model to learn explicit transaction semantics. Then, we propose a transaction attribute similarity graph to learn transaction similarity information, enabling us to capture intuitive insights into transaction anomalies. Additionally, we construct an account interaction graph to capture the structural information of the account transaction network. We employ a deep multi-head attention network to fuse transaction semantic and similarity embeddings, and ultimately propose a joint training approach for the multi-head attention network and the account interaction graph to obtain the synergistic benefits of both.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.07494
  13. By: Jesús Fernández-Villaverde; Kenneth Gillingham; Simon Scheidegger
    Abstract: There is a rapidly advancing literature on the macroeconomics of climate change. This review focuses on developments in the construction and solution of structural integrated assessment models (IAMs), highlighting the marriage of state-of-the-art natural science with general equilibrium theory. We discuss challenges in solving dynamic stochastic IAMs with sharp nonlinearities, multiple regions, and multiple sources of risk. Key innovations in deep learning and other machine learning approaches overcome many computational challenges and enhance the accuracy and relevance of policy findings. We conclude with an overview of recent applications of IAMs and key policy insights.
    JEL: C61 E27 Q5 Q51 Q54 Q58
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:32963
  14. By: Mukherjee, Krishnendu
    Abstract: Transportation Network Companies (TNCs) face two extreme situa-tions, namely, high demand and low demand. In high demand, TNCs use surge multiplier or surge rate to balance the high demand of riders with available drivers. Willingness of drivers, willingness of riders to pay more and appropriate surge rate play a crucial role in maximizing profits of TNCs. Otherwise, a considerable number of trips can be dis-carded either by drivers or riders. This paper explains an application of a combined classification and regression model for surge rate pre-diction. In this paper, twenty-six different machine learning (ML) al-gorithms are considered for classification and twenty-nine ML algo-rithms are considered for regression. A total of 55 ML algorithms is considered for surge rate prediction. This paper shows that estimated distance, trip price, acceptance date and time of the trip, finishing time of the trip, starting time of the trip, search radius, base price, wind velocity, humidity, wind pressure, temperature etc. determine whether surge rate or surge multiplier will be applied or not. The price per mi-nute applied for the current trip or minute price, base price, cost of the trip after inflation or deflation (i.e. trip price), the applied radius search for the trip or search radius, humidity, acceptance date of the trip with date and time, barometric pressure, wind velocity, minimum price of the trip, the price per km etc., on the other hands, influenced surge rate A case study has been discussed to implement the proposed algorithm.
    Keywords: Machine Learning, Surge Rate Prediction, Surge Price, Classification, Regression, Random Forest, Light GBM, XGBoost
    JEL: C63 C88 Y10
    Date: 2024–09–19
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:122151
  15. By: Shengkun Wang; Taoran Ji; Linhan Wang; Yanshen Sun; Shang-Ching Liu; Amit Kumar; Chang-Tien Lu
    Abstract: The stock price prediction task holds a significant role in the financial domain and has been studied for a long time. Recently, large language models (LLMs) have brought new ways to improve these predictions. While recent financial large language models (FinLLMs) have shown considerable progress in financial NLP tasks compared to smaller pre-trained language models (PLMs), challenges persist in stock price forecasting. Firstly, effectively integrating the modalities of time series data and natural language to fully leverage these capabilities remains complex. Secondly, FinLLMs focus more on analysis and interpretability, which can overlook the essential features of time series data. Moreover, due to the abundance of false and redundant information in financial markets, models often produce less accurate predictions when faced with such input data. In this paper, we introduce StockTime, a novel LLM-based architecture designed specifically for stock price data. Unlike recent FinLLMs, StockTime is specifically designed for stock price time series data. It leverages the natural ability of LLMs to predict the next token by treating stock prices as consecutive tokens, extracting textual information such as stock correlations, statistical trends and timestamps directly from these stock prices. StockTime then integrates both textual and time series data into the embedding space. By fusing this multimodal data, StockTime effectively predicts stock prices across arbitrary look-back periods. Our experiments demonstrate that StockTime outperforms recent LLMs, as it gives more accurate predictions while reducing memory usage and runtime costs.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08281
  16. By: Mohit Apte; Yashodhara Haribhakta
    Abstract: In the rapidly evolving field of financial forecasting, the application of neural networks presents a compelling advancement over traditional statistical models. This research paper explores the effectiveness of two specific neural forecasting models, N-HiTS and N-BEATS, in predicting financial market trends. Through a systematic comparison with conventional models, this study demonstrates the superior predictive capabilities of neural approaches, particularly in handling the non-linear dynamics and complex patterns inherent in financial time series data. The results indicate that N-HiTS and N-BEATS not only enhance the accuracy of forecasts but also boost the robustness and adaptability of financial predictions, offering substantial advantages in environments that require real-time decision-making. The paper concludes with insights into the practical implications of neural forecasting in financial markets and recommendations for future research directions.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.00480
  17. By: Di Wu
    Abstract: In the context of financial credit risk evaluation, the fairness of machine learning models has become a critical concern, especially given the potential for biased predictions that disproportionately affect certain demographic groups. This study investigates the impact of data preprocessing, with a specific focus on Truncated Singular Value Decomposition (SVD), on the fairness and performance of probability of default models. Using a comprehensive dataset sourced from Kaggle, various preprocessing techniques, including SVD, were applied to assess their effect on model accuracy, discriminatory power, and fairness.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2408.15452
  18. By: Peng Zhu; Yuante Li; Yifan Hu; Qinyuan Liu; Dawei Cheng; Yuqi Liang
    Abstract: Stock price prediction is a challenging problem in the field of finance and receives widespread attention. In recent years, with the rapid development of technologies such as deep learning and graph neural networks, more research methods have begun to focus on exploring the interrelationships between stocks. However, existing methods mostly focus on the short-term dynamic relationships of stocks and directly integrating relationship information with temporal information. They often overlook the complex nonlinear dynamic characteristics and potential higher-order interaction relationships among stocks in the stock market. Therefore, we propose a stock price trend prediction model named LSR-IGRU in this paper, which is based on long short-term stock relationships and an improved GRU input. Firstly, we construct a long short-term relationship matrix between stocks, where secondary industry information is employed for the first time to capture long-term relationships of stocks, and overnight price information is utilized to establish short-term relationships. Next, we improve the inputs of the GRU model at each step, enabling the model to more effectively integrate temporal information and long short-term relationship information, thereby significantly improving the accuracy of predicting stock trend changes. Finally, through extensive experiments on multiple datasets from stock markets in China and the United States, we validate the superiority of the proposed LSR-IGRU model over the current state-of-the-art baseline models. We also apply the proposed model to the algorithmic trading system of a financial company, achieving significantly higher cumulative portfolio returns compared to other baseline methods. Our sources are released at https://github.com/ZP1481616577/Baseline s\_LSR-IGRU.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08282
  19. By: Lo\"ic Mar\'echal; Nathan Monnet
    Abstract: We use a methodology based on a machine learning algorithm to quantify firms' cyber risks based on their disclosures and a dedicated cyber corpus. The model can identify paragraphs related to determined cyber-threat types and accordingly attribute several related cyber scores to the firm. The cyber scores are unrelated to other firms' characteristics. Stocks with high cyber scores significantly outperform other stocks. The long-short cyber risk factors have positive risk premia, are robust to all factors' benchmarks, and help price returns. Furthermore, we suggest the market does not distinguish between different types of cyber risks but instead views them as a single, aggregate cyber risk.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08728
  20. By: Moramarco, Domenico; Brunori, Paolo; Salas-Rojo, Pedro
    Abstract: In this paper we discuss some limitations of using survey data to measure inequality of opportunity. First, we highlight a link between the two fundamental principles of the theory of equal opportunities – compensation and reward – and the concepts of power and confidence levels in hypothesis testing. This connection can be used to address, for example, whether a sample has sufficient observations to appropriately measure inequality of opportunity. Second, we propose a set of tools to normatively assess inequality of opportunity estimates in any type partition. We apply our proposal to Conditional Inference Trees, a machine learning technique that has received growing attention in the literature. Finally, guided by such tools, we suggest that standard tree-based partitions can be manipulated to reduce the risk of compensation and reward principles. Our methodological contribution is complemented with an application using a quasi-administrative sample of Italian PhD graduates. We find a substantial level of labor income inequality among two cohorts of PhD graduates (2012 and 2014), with a significant portion explained by circumstances beyond their control.
    Keywords: equality of opportunity; machine learning; PhD graduates; compensation; reward
    JEL: C30 D31 D63
    Date: 2024–09–01
    URL: https://d.repec.org/n?u=RePEc:ehl:lserod:125442
  21. By: Stefania Albanesi; Domonkos F. Vamossy
    Abstract: Credit scores are critical for allocating consumer debt in the United States, yet little evidence is available on their performance. We benchmark a widely used credit score against a machine learning model of consumer default and find significant misclassification of borrowers, especially those with low scores. Our model improves predictive accuracy for young, low-income, and minority groups due to its superior performance with low quality data, resulting in a gain in standing for these populations. Our findings suggest that improving credit scoring performance could lead to more equitable access to credit.
    JEL: C45 D14 E27 G21 G24 G5 G51
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:32917
  22. By: Esmeralda Arranhado; Lágida Barbosa; João A. Bastos
    Abstract: We examine an individual-level poverty measure for Benin using cross-sectional data. Since our measure is defined within the interval [0, 1], we combine fractional regression models and machine learning models for fractions to examine the factors influencing multidimensional poverty measures and to predict poverty levels. Our approach illustrates the potential of combining parametric models, that inform on the statistical significance and variable interactions, with SHapley Additive ex- Planations (SHAP) and Accumulated Local Effects (ALE) plots obtained from a random forest. Results highlight the importance of addressing gender inequalities in education, particularly by increasing access to female education, to effectively reduce poverty. Furthermore, natural conditions arising from agroecological zones are significant determinants of multidimensional poverty, which underscores the need for climate change policies to address poverty in the long term, especially in countries heavily reliant on agriculture. Other significant determinants of welfare include household size, employment sector, and access to financial accounts.
    Keywords: Multidimensional Poverty; Benin; Fractional regression model; Machine learning; SHAP values; ALE plots.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:ise:remwps:wp03432024
  23. By: Paliński, Michał (University of Warsaw); Aşık, Gunes A. (TOBB University of Economy and Technology); Gajderowicz, Tomasz (University of Warsaw); Jakubowski, Maciej (University of Warsaw); Nas Özen, Efşan (World Bank); Raju, Dhushyanth (World Bank)
    Abstract: This study expands the inventory of green job titles by incorporating a global perspective and using contemporary sources. It leverages natural language processing, specifically a retrieval-augmented generation model, to identify green job titles. The process began with a search of academic literature published after 2008 using the official APIs of Scopus and Web of Science. The search yielded 1, 067 articles, from which 695 unique potential green job titles were identified. The retrieval-augmented generation model used the advanced text analysis capabilities of Generative Pre-trained Transformer 4, providing a reproducible method to categorize jobs within various green economy sectors. The research clustered these job titles into 25 distinct sectors. This categorization aligns closely with established frameworks, such as the U.S. Department of Labor's Occupational Information Network, and suggests potential new categories like green human resources. The findings demonstrate the efficacy of advanced natural language processing models in identifying emerging green job roles, contributing significantly to the ongoing discourse on the green economy transition.
    Keywords: AI, text mining, occupational classification, green jobs, green economy
    JEL: J23 Q52 O14
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp17286
  24. By: Carrieri, Vincenzo (University of Calabria); de Blasio, G. (International Monetary Fund); Ferrara, Andreas (University of Warwick); Nisticò, Rosanna (University of Calabria)
    Abstract: This paper provides insights into the design of effective location-based policies. In the context of European regional policy, we use algorithms to predict regions that are likely to underutilize funding and identify the key determinants of their low absorptive capacity. We then use a regression discontinuity design (RDD) to document that EU funds are ineffective in recipients predicted to have low absorptive capacity while increasing output and employment in high-capacity regions. Our approach allows early identification and targeting of interventions to increase regional spending capacity based on publicly available data and standard algorithms, thereby facilitating implementation by policymakers.
    Keywords: program design, location-based policies, machine learning
    JEL: C21 F35 H77 R11
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp17308
  25. By: Sandy Chen; Leqi Zeng; Abhinav Raghunathan; Flora Huang; Terrence C. Kim
    Abstract: Large Language Models (LLMs) research in the financial domain is particularly complex due to the sheer number of approaches proposed in literature. Retrieval-Augmented Generation (RAG) has emerged as one of the leading methods in the sector due to its inherent groundedness and data source variability. In this work, we introduce a RAG framework called Mixture of Agents (MoA) and demonstrate its viability as a practical, customizable, and highly effective approach for scaling RAG applications. MoA is essentially a layered network of individually customized small language models (Hoffmann et al., 2022) collaborating to answer questions and extract information. While there are many theoretical propositions for such an architecture and even a few libraries for generally applying the structure in practice, there are limited documented studies evaluating the potential of this framework considering real business constraints such as cost and speed. We find that the MoA framework, consisting of small language models (Hoffmann et al., 2022), produces higher quality and more grounded responses across various financial domains that are core to Vanguard's business while simultaneously maintaining low costs.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.07487
  26. By: Omer Faruk Akbal
    Abstract: This paper shows that the Expectation-Maximization (EM) algorithm for regime-switching dynamic factor models provides satisfactory performance relative to other estimation methods and delivers a good trade-off between accuracy and speed, which makes it especially useful for large dimensional data. Unlike traditional numerical maximization approaches, this methodology benefits from closed-form solutions for parameter estimation, enhancing its practicality for real-time applications and historical data exercises with focus on frequent updates. In a nowcasting application to vintage US data, I study the information content and relative performance of regime-switching model after each data releases in a fifteen year period, which was only feasible due to the time efficiency of the proposed estimation methodology. While existing literature has already acknowledged the performance improvement of nowcasting models under regime-switching, this paper shows that the superior nowcasting performance observed particularly when key economic indicators are released. In a backcasting exercise, I show that the model can closely match the recession starting and ending dates of the NBER despite having less information than actual committee meetings, where the fit between actual dates and model estimates becomes more apparent with the additional available information and recession end dates are fully covered with a lag of three to six months. Given that the EM algorithm proposed in this paper is suitable for various regime-switching configurations, this paper provides economists and policymakers with a valuable tool for conducting comprehensive analyses, ranging from point estimates to information decomposition and persistence of recessions in larger datasets.
    Date: 2024–09–06
    URL: https://d.repec.org/n?u=RePEc:imf:imfwpa:2024/190
  27. By: Gary Cornwall; Marina Gindelsky
    Abstract: Inequality statistics are usually calculated from high-quality, comprehensive survey or administrative microdata. Naturally, this data is typically available with a lag of at least 9 months from the reference period. In turbulent times, there is interest in knowing the distributional impacts of observable aggregate business cycle and policy changes sooner. In this paper, we use an elastic net, a generalized model that incorporates lasso and ridge regressions as special cases, to nowcast the overall Gini coefficient and quintile level income shares. National accounts data, published by the Bureau of Economic Analysis, are used (starting in 2000) as features instead of the underlying microdata to produce a series of distributional nowcasts for 2020–2022. These nowcasts predict turning points with at least 85 percent accuracy across all metrics and minimal errors relative to na¨ıve models. We find we could plausibly create advance inequality estimates approximately one month after the end of the calendar year, reducing the present lag by almost a year.
    Keywords: inequality, income distribution, national accounts, nowcasting, machine learning
    JEL: C52 C53 D31 E01
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:gwc:wpaper:2024-003
  28. By: Ziyan Cui; Ning Li; Huaikang Zhou
    Abstract: Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.00128
  29. By: Jingru Jia; Zehua Yuan
    Abstract: This study explores the potential of large language models (LLMs) to conduct market experiments, aiming to understand their capability to comprehend competitive market dynamics. We model the behavior of market agents in a controlled experimental setting, assessing their ability to converge toward competitive equilibria. The results reveal the challenges current LLMs face in replicating the dynamic decision-making processes characteristic of human trading behavior. Unlike humans, LLMs lacked the capacity to achieve market equilibrium. The research demonstrates that while LLMs provide a valuable tool for scalable and reproducible market simulations, their current limitations necessitate further advancements to fully capture the complexities of market behavior. Future work that enhances dynamic learning capabilities and incorporates elements of behavioral economics could improve the effectiveness of LLMs in the economic domain, providing new insights into market dynamics and aiding in the refinement of economic policies.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.08357
  30. By: Anna Josephson; Jeffrey D. Michler; Talip Kilic; Siobhan Murray
    Abstract: The availability of weather data from remotely sensed Earth observation (EO) data has reduced the cost of including weather variables in econometric models. Weather variables are common instrumental variables used to predict economic outcomes and serve as an input into modelling crop yields for rainfed agriculture. The use of EO data in econometric applications has only recently been met with a critical assessment of the suitability and quality of this data in economics. We quantify the significance and magnitude of the effect of measurement error in EO data in the context of smallholder agricultural productivity. We find that different measurement methods from different EO sources: findings are not robust to the choice of EO dataset and outcomes are not simply affine transformations of one another. This begs caution on the part of researchers using these data and suggests that robustness checks should include testing alternative sources of EO data.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.07506

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.