|
on Big Data |
By: | Pierre Beaucoral |
Abstract: | Analyzing development projects is crucial for understanding donors aid strategies, recipients priorities, and to assess development finance capacity to adress development issues by on-the-ground actions. In this area, the Organisation for Economic Co-operation and Developments (OECD) Creditor Reporting System (CRS) dataset is a reference data source. This dataset provides a vast collection of project narratives from various sectors (approximately 5 million projects). While the OECD CRS provides a rich source of information on development strategies, it falls short in informing project purposes due to its reporting process based on donors self-declared main objectives and pre-defined industrial sectors. This research employs a novel approach that combines Machine Learning (ML) techniques, specifically Natural Language Processing (NLP), an innovative Python topic modeling technique called BERTopic, to categorise (cluster) and label development projects based on their narrative descriptions. By revealing existing yet hidden topics of development finance, this application of artificial intelligence enables a better understanding of donor priorities and overall development funding and provides methods to analyse public and private projects narratives. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.09495 |
By: | Mostapha Benhenda (LAGA) |
Abstract: | This paper presents a novel risk-sensitive trading agent combining reinforcement learning and large language models (LLMs). We extend the Conditional Value-at-Risk Proximal Policy Optimization (CPPO) algorithm, by adding risk assessment and trading recommendation signals generated by a LLM from financial news. Our approach is backtested on the Nasdaq-100 index benchmark, using financial news data from the FNSPID dataset and the DeepSeek V3, Qwen 2.5 and Llama 3.3 language models. The code, data, and trading agents are available at: https://github.com/benstaf/FinRL_DeepSee k |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.07393 |
By: | Yuan Liao; Xinjie Ma; Andreas Neuhierl; Linda Schilling |
Abstract: | Machine learning in asset pricing typically predicts expected returns as point estimates, ignoring uncertainty. We develop new methods to construct forecast confidence intervals for expected returns obtained from neural networks. We show that neural network forecasts of expected returns share the same asymptotic distribution as classic nonparametric methods, enabling a closed-form expression for their standard errors. We also propose a computationally feasible bootstrap to obtain the asymptotic distribution. We incorporate these forecast confidence intervals into an uncertainty-averse investment framework. This provides an economic rationale for shrinkage implementations of portfolio selection. Empirically, our methods improve out-of-sample performance. |
Date: | 2025–03 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2503.00549 |
By: | Peiwan Wang; Chenhao Cui; Yong Li |
Abstract: | In recent years, the dominance of machine learning in stock market forecasting has been evident. While these models have shown decreasing prediction errors, their robustness across different datasets has been a concern. A successful stock market prediction model minimizes prediction errors and showcases robustness across various data sets, indicating superior forecasting performance. This study introduces a novel multiple lag order probabilistic model based on trend encoding (TeMoP) that enhances stock market predictions through a probabilistic approach. Results across different stock indexes from nine countries demonstrate that the TeMoP outperforms the state-of-the-art machine learning models in predicting accuracy and stabilization. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.08144 |
By: | Yanhao (Max); Wei; Zhenling Jiang |
Abstract: | We study an alternative use of machine learning. We train neural nets to provide the parameter estimate of a given (structural) econometric model, for example, discrete choice or consumer search. Training examples consist of datasets generated by the econometric model under a range of parameter values. The neural net takes the moments of a dataset as input and tries to recognize the parameter value underlying that dataset. Besides the point estimate, the neural net can also output statistical accuracy. This neural net estimator (NNE) tends to limited-information Bayesian posterior as the number of training datasets increases. We apply NNE to a consumer search model. It gives more accurate estimates at lighter computational costs than the prevailing approach. NNE is also robust to redundant moment inputs. In general, NNE offers the most benefits in applications where other estimation approaches require very heavy simulation costs. We provide code at: https://nnehome.github.io. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.04945 |
By: | Rath Minati; Date Hema |
Abstract: | The integration of Quantum Deep Learning (QDL) techniques into the landscape of financial risk analysis presents a promising avenue for innovation. This study introduces a framework for credit risk assessment in the banking sector, combining quantum deep learning techniques with adaptive modeling for Row-Type Dependent Predictive Analysis (RTDPA). By leveraging RTDPA, the proposed approach tailors predictive models to different loan categories, aiming to enhance the accuracy and efficiency of credit risk evaluation. While this work explores the potential of integrating quantum methods with classical deep learning for risk assessment, it focuses on the feasibility and performance of this hybrid framework rather than claiming transformative industry-wide impacts. The findings offer insights into how quantum techniques can complement traditional financial analysis, paving the way for further advancements in predictive modeling for credit risk. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.07806 |
By: | Yung, Vincent (Northwestern University); Colyvas, Jeannette |
Abstract: | Data wrangling is typically treated as an obligatory, codified, and ideally automated step in the machine learning pipeline. In contrast, we suggest that archival data wrangling is a theory-driven process best understood as a practical craft. Drawing on empirical examples from contemporary computational social science, we identify nine core modes of data wrangling. Although these modes can be seen as a sequence, we emphasize how they are iterative and nonlinear in practice. Moreover, we discuss how data wrangling can address issues of coded bias. Although machine learning emphasizes architectural engineering, we assert that to properly engage with machine learning is to properly engage with data wrangling. |
Date: | 2023–08–18 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:2dve6_v1 |
By: | Verhagen, Mark D. |
Abstract: | There is lively discussion regarding the potential and pitfalls of artificial intelligence (AI) and machine learning (ML) for public policy. This debate tends to focus on replacing human decision-making with (semi-)automated processes and the unique challenges such applications pose for policymakers and society more generally. As this paper argues, particularly ML could be used in a more direct and less controversial way: to improve policy analysis and inform evidence-based policymaking. ML methods can be used to identify sub-groups in a population that differ in their policy effect in a data-driven way, which might otherwise be missed in standard policy analysis. In doing so, a more complete picture of a policy’s impact on a population can be obtained. I illustrate how ML can complement our understanding of policy interventions by studying the nationwide 2015 decentralisation of the social domain in The Netherlands. This policy intervention delegated responsibilities to administer social care from the national to the municipal level. Using ML methods on entire population data in The Netherlands, I find the policy induced strongly heterogeneous effects that include evidence of local capture and strong urban/rural divides. Findings that are crucial for policymakers to assess whether the policy had the desired outcome. |
Date: | 2023–03–16 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:qzm7y_v1 |
By: | Jaskaran Singh Walia; Aarush Sinha; Srinitish Srinivasan; Srihari Unnikrishnan |
Abstract: | Financial bond yield forecasting is challenging due to data scarcity, nonlinear macroeconomic dependencies, and evolving market conditions. In this paper, we propose a novel framework that leverages Causal Generative Adversarial Networks (CausalGANs) and Soft Actor-Critic (SAC) reinforcement learning (RL) to generate high-fidelity synthetic bond yield data for four major bond categories (AAA, BAA, US10Y, Junk). By incorporating 12 key macroeconomic variables, we ensure statistical fidelity by preserving essential market properties. To transform this market dependent synthetic data into actionable insights, we employ a finetuned Large Language Model (LLM) Qwen2.5-7B that generates trading signals (BUY/HOLD/SELL), risk assessments, and volatility projections. We use automated, human and LLM evaluations, all of which demonstrate that our framework improves forecasting performance over existing methods, with statistical validation via predictive accuracy, MAE evaluation(0.103%), profit/loss evaluation (60% profit rate), LLM evaluation (3.37/5) and expert assessments scoring 4.67 out of 5. The reinforcement learning-enhanced synthetic data generation achieves the least Mean Absolute Error of 0.103, demonstrating its effectiveness in replicating real-world bond market dynamics. We not only enhance data-driven trading strategies but also provides a scalable, high-fidelity synthetic financial data pipeline for risk & volatility management and investment decision-making. This work establishes a bridge between synthetic data generation, LLM driven financial forecasting, and language model evaluation, contributing to AI-driven financial decision-making. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.17011 |
By: | Corral Rodas, Paul Andres; Henderson, Heath Linn; Segovia Juarez, Sandra Carolina |
Abstract: | Recent years have witnessed considerable methodological advances in poverty mapping, much of which has focused on the application of modern machine-learning approaches to remotely sensed data. Poverty maps produced with these methods generally share a common validation procedure, which assesses model performance by comparing subnational machine-learning-based poverty estimates with survey-based, direct estimates. Although unbiased, survey-based estimates at a granular level can be imprecise measures of true poverty rates, meaning that it is unclear whether the validation procedures used in machine-learning approaches are informative of actual model performance. This paper examines the credibility of existing approaches to model validation by constructing a pseudo-census from the Mexican Intercensal Survey of 2015, which is used to conduct several design-based simulation experiments. The findings show that the validation procedure often used for machine-learning approaches can be misleading in terms of model assessment since it yields incorrect information for choosing what may be the best set of estimates across different methods and scenarios. Using alternative validation methods, the paper shows that machine-learning-based estimates can rival traditional, more data intensive poverty mapping approaches. Further, the closest approximation to existing machine-learning approaches, using publicly available geo-referenced data, performs poorly when evaluated against “true” poverty rates and fails to outperform traditional poverty mapping methods in targeting simulations. |
Date: | 2023–05–01 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10429 |
By: | Tianmi Ma; Jiawei Du; Wenxin Huang; Wenjie Wang; Liang Xie; Xian Zhong; Joey Tianyi Zhou |
Abstract: | Recent advancements in large language models (LLMs) have significantly improved performance in natural language processing tasks. However, their ability to generalize to dynamic, unseen tasks, particularly in numerical reasoning, remains a challenge. Existing benchmarks mainly evaluate LLMs on problems with predefined optimal solutions, which may not align with real-world scenarios where clear answers are absent. To bridge this gap, we design the Agent Trading Arena, a virtual numerical game simulating complex economic systems through zero-sum games, where agents invest in stock portfolios. Our experiments reveal that LLMs, including GPT-4o, struggle with algebraic reasoning when dealing with plain-text stock data, often focusing on local details rather than global trends. In contrast, LLMs perform significantly better with geometric reasoning when presented with visual data, such as scatter plots or K-line charts, suggesting that visual representations enhance numerical reasoning. This capability is further improved by incorporating the reflection module, which aids in the analysis and interpretation of complex data. We validate our findings on NASDAQ Stock dataset, where LLMs demonstrate stronger reasoning with visual data compared to text. Our code and data are publicly available at https://github.com/wekjsdvnm/Agent-Tradi ng-Arena.git. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.17967 |
By: | Tianzuo Hu |
Abstract: | This paper proposes a financial fraud detection system based on improved Random Forest (RF) and Gradient Boosting Machine (GBM). Specifically, the system introduces a novel model architecture called GBM-SSRF (Gradient Boosting Machine with Simplified and Strengthened Random Forest), which cleverly combines the powerful optimization capabilities of the gradient boosting machine (GBM) with improved randomization. The computational efficiency and feature extraction capabilities of the Simplified and Strengthened Random Forest (SSRF) forest significantly improve the performance of financial fraud detection. Although the traditional random forest model has good classification capabilities, it has high computational complexity when faced with large-scale data and has certain limitations in feature selection. As a commonly used ensemble learning method, the GBM model has significant advantages in optimizing performance and handling nonlinear problems. However, GBM takes a long time to train and is prone to overfitting problems when data samples are unbalanced. In response to these limitations, this paper optimizes the random forest based on the structure, reducing the computational complexity and improving the feature selection ability through the structural simplification and enhancement of the random forest. In addition, the optimized random forest is embedded into the GBM framework, and the model can maintain efficiency and stability with the help of GBM's gradient optimization capability. Experiments show that the GBM-SSRF model not only has good performance, but also has good robustness and generalization capabilities, providing an efficient and reliable solution for financial fraud detection. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.15822 |
By: | Hamid Moradi-Kamali; Mohammad-Hossein Rajabi-Ghozlou; Mahdi Ghazavi; Ali Soltani; Amirreza Sattarzadeh; Reza Entezari-Maleki |
Abstract: | Financial Sentiment Analysis (FSA) traditionally relies on human-annotated sentiment labels to infer investor sentiment and forecast market movements. However, inferring the potential market impact of words based on their human-perceived intentions is inherently challenging. We hypothesize that the historical market reactions to words, offer a more reliable indicator of their potential impact on markets than subjective sentiment interpretations by human annotators. To test this hypothesis, a market-derived labeling approach is proposed to assign tweet labels based on ensuing short-term price trends, enabling the language model to capture the relationship between textual signals and market dynamics directly. A domain-specific language model was fine-tuned on these labels, achieving up to an 11% improvement in short-term trend prediction accuracy over traditional sentiment-based benchmarks. Moreover, by incorporating market and temporal context through prompt-tuning, the proposed context-aware language model demonstrated an accuracy of 89.6% on a curated dataset of 227 impactful Bitcoin-related news events with significant market impacts. Aggregating daily tweet predictions into trading signals, our method outperformed traditional fusion models (which combine sentiment-based and price-based predictions). It challenged the assumption that sentiment-based signals are inferior to price-based predictions in forecasting market movements. Backtesting these signals across three distinct market regimes yielded robust Sharpe ratios of up to 5.07 in trending markets and 3.73 in neutral markets. Our findings demonstrate that language models can serve as effective short-term market predictors. This paradigm shift underscores the untapped capabilities of language models in financial decision-making and opens new avenues for market prediction applications. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.14897 |
By: | Merfeld, Joshua David; Newhouse, David Locke |
Abstract: | Reliable estimates of economic welfare for small areas are valuable inputs into the design and evaluation of development policies. This paper compares the accuracy of point estimates and confidence intervals for small area estimates of wealth and poverty derived from four different prediction methods: linear mixed models, Cubist regression, extreme gradient boosting, and boosted regression forests. The evaluation draws samples from unit-level household census data from four developing countries, combines them with publicly and globally available geospatial indicators to generate small area estimates, and evaluates these estimates against aggregates calculated using the full census. Predictions of wealth are evaluated in four countries and poverty in one. All three machine learning methods outperform the traditional linear mixed model, with extreme gradient boosting and boosted regression forests generally outperforming the other alternatives. The proposed residual bootstrap procedure reliably estimates confidence intervals for the machine learning estimators, with estimated coverage rates across simulations falling between 94 and 97 percent. These results demonstrate that predictions obtained using tree-based gradient boosting with a random effect block bootstrap generate more accurate point and uncertainty estimates than prevailing methods for generating small area welfare estimates. |
Date: | 2023–03–08 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10348 |
By: | Eric Hitz; Mingmin Feng; Radu Tanase; Ren\'e Algesheimer; Manuel S. Mariani |
Abstract: | Recent advances in artificial intelligence have led to the proliferation of artificial agents in social contexts, ranging from education to online social media and financial markets, among many others. The increasing rate at which artificial and human agents interact makes it urgent to understand the consequences of human-machine interactions for the propagation of new ideas, products, and behaviors in society. Across two distinct empirical contexts, we find here that artificial agents lead to significantly faster and wider social contagion. To this end, we replicate a choice experiment previously conducted with human subjects by using artificial agents powered by large language models (LLMs). We use the experiment's results to measure the adoption thresholds of artificial agents and their impact on the spread of social contagion. We find that artificial agents tend to exhibit lower adoption thresholds than humans, which leads to wider network-based social contagions. Our findings suggest that the increased presence of artificial agents in real-world networks may accelerate behavioral shifts, potentially in unforeseen ways. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.21037 |
By: | Songrun He; Linying Lv; Asaf Manela; Jimmy Wu |
Abstract: | Large language models are increasingly used in social sciences, but their training data can introduce lookahead bias and training leakage. A good chronologically consistent language model requires efficient use of training data to maintain accuracy despite time-restricted data. Here, we overcome this challenge by training chronologically consistent large language models timestamped with the availability date of their training data, yet accurate enough that their performance is comparable to state-of-the-art open-weight models. Lookahead bias is model and application-specific because even if a chronologically consistent language model has poorer language comprehension, a regression or prediction model applied on top of the language model can compensate. In an asset pricing application, we compare the performance of news-based portfolio strategies that rely on chronologically consistent versus biased language models and estimate a modest lookahead bias. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.21206 |
By: | Kosuke Imai; Michael Lingzhi Li |
Abstract: | We analyze the split-sample robust inference (SSRI) methodology proposed by Chernozhukov, Demirer, Duflo, and Fernandez-Val (CDDF) for quantifying uncertainty in heterogeneous treatment effect estimation. While SSRI effectively accounts for randomness in data splitting, its computational cost can be prohibitive when combined with complex machine learning (ML) models. We present an alternative randomization inference (RI) approach that maintains SSRI's generality without requiring repeated data splitting. By leveraging cross-fitting and design-based inference, RI achieves valid confidence intervals while significantly reducing computational burden. We compare the two methods through simulation, demonstrating that RI retains statistical efficiency while being more practical for large-scale applications. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.06758 |
By: | Lett, Elle; La Cava, William |
Abstract: | Machine learning (ML)-derived tools are rapidly being deployed as an additional input in the clinical decision-making process to optimize health interventions. However, ML models also risk propagating societal discrimination and exacerbating existing health inequities. The field of ML fairness has focused on developing approaches to mitigate bias in ML models. To date, the focus has been on the model fitting process, simplifying the processes of structural discrimination to definitions of model bias based on performance metrics. Here, we reframe the ML task through the lens of intersectionality, a Black feminist theoretical framework that contextualizes individuals in interacting systems of power and oppression, linking inquiry into measuring fairness to the pursuit of health justice. In doing so, we present intersectional ML fairness as a paradigm shift that moves from an emphasis on model metrics to an approach for ML that is centered around achieving more equitable health outcomes. |
Date: | 2023–02–27 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:gu7yh_v1 |
By: | Daniel Brunstein (LISA - Laboratoire « Lieux, Identités, eSpaces, Activités » (UMR CNRS 6240 LISA) - CNRS - Centre National de la Recherche Scientifique - Università di Corsica Pasquale Paoli [Université de Corse Pascal Paoli]); Georges Casamatta (LISA - Laboratoire « Lieux, Identités, eSpaces, Activités » (UMR CNRS 6240 LISA) - CNRS - Centre National de la Recherche Scientifique - Università di Corsica Pasquale Paoli [Université de Corse Pascal Paoli]); Sauveur Giannoni (Università di Corsica Pasquale Paoli [Université de Corse Pascal Paoli], LISA - Laboratoire « Lieux, Identités, eSpaces, Activités » (UMR CNRS 6240 LISA) - CNRS - Centre National de la Recherche Scientifique - Università di Corsica Pasquale Paoli [Université de Corse Pascal Paoli]) |
Abstract: | This study investigates the influence of Airbnb on property prices in Corsica. Leveraging machine learning techniques, we obtain more robust results than those achieved with conventional methods and uncover heterogeneous effects of Airbnb on property values. Our analysis reveals that a 1% increase in Airbnb listings leads to an average 0.21% rise in house prices. Interestingly, this effect is more pronounced in economically less developed regions, such as inland municipalities and remote seaside resorts, compared to traditionally popular tourist destinations and urban areas. |
Keywords: | Short-term rental, Housing market, Machine learning, Heterogeneous effects |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:hal:journl:hal-04934630 |
By: | Francesco Puoti; Fabrizio Pittorino; Manuel Roveri |
Abstract: | This paper offers a thorough examination of the univariate predictability in cryptocurrency time-series. By exploiting a combination of complexity measure and model predictions we explore the cryptocurrencies time-series forecasting task focusing on the exchange rate in USD of Litecoin, Binance Coin, Bitcoin, Ethereum, and XRP. On one hand, to assess the complexity and the randomness of these time-series, a comparative analysis has been performed using Brownian and colored noises as a benchmark. The results obtained from the Complexity-Entropy causality plane and power density spectrum analysis reveal that cryptocurrency time-series exhibit characteristics closely resembling those of Brownian noise when analyzed in a univariate context. On the other hand, the application of a wide range of statistical, machine and deep learning models for time-series forecasting demonstrates the low predictability of cryptocurrencies. Notably, our analysis reveals that simpler models such as Naive models consistently outperform the more complex machine and deep learning ones in terms of forecasting accuracy across different forecast horizons and time windows. The combined study of complexity and forecasting accuracies highlights the difficulty of predicting the cryptocurrency market. These findings provide valuable insights into the inherent characteristics of the cryptocurrency data and highlight the need to reassess the challenges associated with predicting cryptocurrency's price movements. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.09079 |
By: | Morande, Swapnil; Arshi, Tahseen; Gul, Kanwal; Amini, Mitra |
Abstract: | This pioneering study employs machine learning to predict startup success, addressing the long-standing challenge of deciphering entrepreneurial outcomes amidst uncertainty. Integrating the multidimensional SECURE framework for holistic opportunity evaluation with AI's pattern recognition prowess, the research puts forth a novel analytics-enabled approach to illuminate success determinants. Rigorously constructed predictive models demonstrate remarkable accuracy in forecasting success likelihood, validated through comprehensive statistical analysis. The findings reveal AI’s immense potential in bringing evidence-based objectivity to the complex process of opportunity assessment. On the theoretical front, the research enriches entrepreneurship literature by bridging the knowledge gap at the intersection of structured evaluation tools and data science. On the practical front, it empowers entrepreneurs with an analytical compass for decision-making and helps investors make prudent funding choices. The study also informs policymakers to optimize conditions for entrepreneurship. Overall, it lays the foundation for a new frontier of AI-enabled, data-driven entrepreneurship research and practice. However, acknowledging AI’s limitations, the synthesis underscores the persistent relevance of human creativity alongside data-backed insights. With high predictive performance and multifaceted implications, the SECURE-AI model represents a significant stride toward an analytics-empowered paradigm in entrepreneurship management. |
Date: | 2023–08–29 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:p3gyb_v1 |
By: | Tanay Panat; Rohitash Chandra |
Abstract: | The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is important to understand the long-term nature of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework for addressing the problem of missing data for some of the economic indicators for specific countries. We then curate and update the data and use a dimensionality reduction approach (principal component analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts. This transparency and accessibility make our work a valuable resource for ongoing research and policy development in quality-of-life assessment. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.06866 |
By: | Iman Modarressi; Jann Spiess; Amar Venugopal |
Abstract: | We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts. |
Date: | 2025–03 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2503.00725 |
By: | Xu, Wenfei |
Abstract: | Mid-20th urban renewal in the United States was transformational for the physical urban fabric and socioeconomic trajectories of these neighborhoods and its displaced residents. However, there is little research that systematically investigates its impacts due to incomplete national data. This article uses a multiple machine learning method to discover 204 new Census tracts that were likely sites of federal urban renewal, highway construction related demolition, and other urban renewal projects between 1949 and 1970. It also aims to understand the factors motivating the decision to “renew” certain neighborhoods. I find that race, housing age, and homeownership are all determinants of renewal. Moreover, by stratifying the analysis along neighborhoods perceived to be more or less risky, I also find that race and housing age are two distinct channels that influence renewal. |
Date: | 2023–05–13 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:bsvr8_v1 |
By: | Andree, Bo Pieter Johannes; Pape, Utz Johann |
Abstract: | Capabilities to track fast-moving economic developments re-main limited in many regions of the developing world. This complicates prioritizing policies aimed at supporting vulnerable populations. To gain insight into the evolution of fluid events in a data scarce context, this paper explores the ability of recent machine-learning advances to produce continuous data in near-real-time by imputing multiple entries in ongoing surveys. The paper attempts to track inflation in fresh produce prices at the local market level in Papua New Guinea, relying only on incomplete and intermittent survey data. This application is made challenging by high intra-month price volatility, low cross-market price correlations, and weak price trends. The modeling approach uses chained equations to produce an ensemble prediction for multiple price quotes simultaneously. The paper runs cross-validation of the prediction strategy under different designs in terms of markets, foods, and time periods covered. The results show that when the survey is well-designed, imputations can achieve accuracy that is attractive when compared to costly–and logistically often infeasible–direct measurement. The methods h ave wider applicability and could help to fill crucial data gaps in data scarce regions such as the Pacific Islands, especially in conjunction with specifically designed continuous surveys. |
Date: | 2023–09–05 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10559 |
By: | Hui Chen; Giovanni Gambarotta; Simon Scheidegger; Yu Xu |
Abstract: | We build a state-of-the-art dynamic model of private asset allocation that considers five key features of private asset markets: (1) the illiquid nature of private assets, (2) timing lags between capital commitments, capital calls, and eventual distributions, (3) time-varying business cycle conditions, (4) serial correlation in observed private asset returns, and (5) regulatory constraints on certain institutional investors' portfolio choices. We use cutting-edge machine learning methods to quantify the optimal investment policies over the life cycle of a fund. Moreover, our model offers regulators a tool for precisely quantifying the trade-offs when setting risk-based capital charges. |
Date: | 2025–03 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2503.01099 |
By: | Goldemberg, Diana; Jordan, Luke Simon; Kenyon, Thomas |
Abstract: | This paper applies novel techniques to long-standing questions of aid effectiveness. It first replicates findings that donor finance is discernibly but weakly associated with sector outcomes in recipient countries. It then shows robustly that donors' own ratings of project success provide limited information on the contribution of those projects to development outcomes. By training a machine learning model on World Bank projects, the paper shows instead that the strongest predictor of these projects’ contribution to outcomes is their degree of adaptation to country context, and the largest differences between ratings and actual impact occur in large projects in institutionally weak settings. |
Date: | 2023–07–31 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10532 |
By: | Leonardo Berti; Gjergji Kasneci |
Abstract: | Stock Price Trend Prediction (SPTP) based on Limit Order Book (LOB) data is a fundamental challenge in financial markets. Despite advances in deep learning, existing models fail to generalize across different market conditions and struggle to reliably predict short-term trends. Surprisingly, by adapting a simple MLP-based architecture to LOB, we show that we surpass SoTA performance; thus, challenging the necessity of complex architectures. Unlike past work that shows robustness issues, we propose TLOB, a transformer-based model that uses a dual attention mechanism to capture spatial and temporal dependencies in LOB data. This allows it to adaptively focus on the market microstructure, making it particularly effective for longer-horizon predictions and volatile market conditions. We also introduce a new labeling method that improves on previous ones, removing the horizon bias. We evaluate TLOB's effectiveness using the established FI-2010 benchmark, which exceeds the state-of-the-art by an average of 3.7 F1-score(\%). Additionally, TLOB shows improvements on Tesla and Intel with a 1.3 and 7.7 increase in F1-score(\%), respectively. Additionally, we empirically show how stock price predictability has declined over time (-6.68 absolute points in F1-score(\%)), highlighting the growing market efficiencies. Predictability must be considered in relation to transaction costs, so we experimented with defining trends using an average spread, reflecting the primary transaction cost. The resulting performance deterioration underscores the complexity of translating trend classification into profitable trading strategies. We argue that our work provides new insights into the evolving landscape of stock price trend prediction and sets a strong foundation for future advancements in financial AI. We release the code at https://github.com/LeonardoBerti00/TLOB. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.15757 |
By: | Mahony, Christopher Brian; Manning, Matthew; Wong, Gabriel |
Abstract: | Can the impact of justice processes be enhanced with the inclusion of a heterogeneous component into an existing cost-benefit analysis app that demonstrates how benefactors and beneficiaries are affected Such a component requires (i) moving beyond the traditional cost-benefit conceptual framework of utilizing averages, (ii) identification of social group or population-specific variation, (iii) identification of how justice processes differ across groups/populations, (iv) distribution of costs and benefits according to the identified variations, and (v) utilization of empirically informed statistical techniques to gain new insights from data and maximize the impact for beneficiaries. This paper outlines a method for capturing heterogeneity. The paper tests the method and the cost-benefit analysis online app that was developed using primary data collected from a developmental crime prevention intervention in Australia. The paper identifies how subgroups in the intervention display different behavioral adjustments across the reference period, revealing the heterogeneous distribution of costs and benefits. Finally, the paper discusses the next version of the cost-benefit analysis app, which incorporates an artificial intelligence-driven component that reintegrates individual cost-benefit analysis projects using machine learning and other modern data science techniques. The paper argues that the app enhances cost-benefit analysis, development outcomes, and policy making efficiency for optimal prioritization of criminal justice resources. Further, the app advances the policy accessibility of enhanced, social group-specific data, illuminating optimal policy orientation for more inclusive, just, and resilient societal outcomes—an approach with potential across broader public policy. |
Date: | 2023–05–18 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10449 |
By: | Pecorari, Natalia Gisel; Cuesta Leiva, Jose Antonio |
Abstract: | This paper advances the understanding of the linkages between trust in government and citizen participation in Latin America and the Caribbean, using machine learning techniques and Latinobarómetro 2020 data. Proponents of the concept of stealth democracy argue that an inverse relationship exists between political trust and citizen participation, while deliberative democracy theorists claim the opposite. The paper estimates that trust in national governments or other governmental institutions plays neither a dominant nor consistent role in driving political participation. Instead, interest in politics, personal circumstances such as experience of crime and discrimination, and socioeconomic aspects appear to drive citizen participation much more strongly in the Latin America and the Caribbean region. This is true across models imposing simple linear trends (logit and lasso) and others allowing for nonlinear and complex relations (decision trees). The results vary across the type of participation—signing a petition, participation in demonstrations, or involvement in a community issue—which the paper attributes to increasing net costs associated with participation. |
Date: | 2023–03–01 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10335 |
By: | Hsin-Min Lu; Yu-Tai Chien; Huan-Hsun Yen; Yen-Hsiu Chen |
Abstract: | Extracting specific items from 10-K reports remains challenging due to variations in document formats and item presentation. Traditional rule-based item segmentation approaches often yield suboptimal results. This study introduces two advanced item segmentation methods leveraging language models: (1) GPT4ItemSeg, using a novel line-ID-based prompting mechanism to utilize GPT4 for item segmentation, and (2) BERT4ItemSeg, combining BERT embeddings with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3, 737 annotated 10-K reports, BERT4ItemSeg achieved a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers satisfactory item segmentation performance, while GPT4ItemSeg can easily adapt to regulatory changes. Together, they offer practical benefits for researchers and practitioners, enabling reliable empirical studies and automated 10-K item segmentation functionality. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.08875 |
By: | Ruoyu Guo; Haochen Qiu |
Abstract: | Making consistently profitable financial decisions in a continuously evolving and volatile stock market has always been a difficult task. Professionals from different disciplines have developed foundational theories to anticipate price movement and evaluate securities such as the famed Capital Asset Pricing Model (CAPM). In recent years, the role of artificial intelligence (AI) in asset pricing has been growing. Although the black-box nature of deep learning models lacks interpretability, they have continued to solidify their position in the financial industry. We aim to further enhance AI's potential and utility by introducing a return-weighted loss function that will drive top growth while providing the ML models a limited amount of information. Using only publicly accessible stock data (open/close/high/low, trading volume, sector information) and several technical indicators constructed from them, we propose an efficient daily trading system that detects top growth opportunities. Our best models achieve 61.73% annual return on daily rebalancing with an annualized Sharpe Ratio of 1.18 over 1340 testing days from 2019 to 2024, and 37.61% annual return with an annualized Sharpe Ratio of 0.97 over 1360 testing days from 2005 to 2010. The main drivers for success, especially independent of any domain knowledge, are the novel return-weighted loss function, the integration of categorical and continuous data, and the ML model architecture. We also demonstrate the superiority of our novel loss function over traditional loss functions via several performance metrics and statistical evidence. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.17493 |
By: | Hasan, Sacha; Yuan, Yingfang |
Abstract: | Despite the accelerated digitalisation of social housing services, there has been a lack of focused attention to the harms that are likely to arise through the systemic inequalities encountered by minoritised ethnic (ME) communities in the UK. Within this context, we are employing an intersectional framework to underline the centrality of age to ME vulnerabilities including lack of digital literacy and proficiency in English in the access, use and outcomes of digitalised social housing services. We draw our findings from an interdisciplinary sentimental analysis of 100 interviews with ME individuals in Glasgow, Bradford, Manchester and Tower Hamlets for extracting vulnerabilities and assessing their intensities across different ME age groups, and a subsample of qualitative analysis of 21 interviews. This is to illustrate similarities and differences of sentimental analysis of these vulnerabilities between machine learning (ML) and inductive coding, offering an example for future ML supported qualitative data analysis approach in housing studies. |
Date: | 2023–06–02 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:jtc8k_v1 |
By: | Zhu, Jin; Wan, Runzhe; Qi, Zhengling; Luo, Shikai; Shi, Chengchun |
Abstract: | This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavytailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavytailed reward distributions. The implementation of the proposal is available at https://github.com/Mamba413/ROOM. |
Keywords: | Rights Retention |
JEL: | C1 |
Date: | 2024–05–02 |
URL: | https://d.repec.org/n?u=RePEc:ehl:lserod:122740 |
By: | Michael Pfarrhofer; Anna Stelzer |
Abstract: | We present an econometric framework that adapts tools for scenario analysis, such as variants of conditional forecasts and impulse response functions, for use with dynamic nonparametric multivariate models. We demonstrate the utility of our approach with simulated data and three real-world applications: (1) scenario-based conditional forecasts aligned with Federal Reserve stress test assumptions, measuring (2) macroeconomic risk under varying financial conditions, and (3) asymmetric effects of US-based financial shocks and their international spillovers. Our results indicate the importance of nonlinearities and asymmetries in dynamic relationships between macroeconomic and financial variables. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.08440 |
By: | Moustapha Pemy; Na Zhang |
Abstract: | This paper studies the ubiquitous problem of liquidating large quantities of highly correlated stocks, a task frequently encountered by institutional investors and proprietary trading firms. Traditional methods in this setting suffer from the curse of dimensionality, making them impractical for high-dimensional problems. In this work, we propose a novel method based on stochastic optimal control to optimally tackle this complex multidimensional problem. The proposed method minimizes the overall execution shortfall of highly correlated stocks using a reinforcement learning approach. We rigorously establish the convergence of our optimal trading strategy and present an implementation of our algorithm using intra-day market data. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.07868 |
By: | Blanchard, Paul Baptiste; Ishizawa Escudero, Oscar Anil; Humbert, Thibaut; Van Der Borght, Rafael |
Abstract: | Weather-related shocks and climate variability contribute to hampering progress toward poverty reduction in Sub-Saharan Africa. Droughts have a direct impact on weather-dependent livelihood means and the potential to affect key dimensions of households’ welfare, including food consumption. Yet, the ability to forecast food insecurity for intervention planning remains limited and current approaches mainly rely on qualitative methods. This paper incorporates microeconomic estimates of the effect of the rainy season quality on food consumption into a catastrophe risk modeling approach to develop a novel framework for early forecasting of food insecurity at sub-national levels. The model relies on three usual components of catastrophe risk models that are adapted for estimation of the impact of rainy season quality on food insecurity: natural hazards, households’ vulnerability and exposure. The paper applies this framework in the context of rural Mauritania and optimizes the model calibration with a machine learning procedure. The model can produce fairly accurate lean season food insecurity predictions very early on in the agricultural season (October-November), that is six to eight months ahead of the lean season. Comparisons of model predictions with survey-based estimates yield a mean absolute error of 1.2 percentage points at the national level and a high degree of correlation at the regional level (0.84). |
Date: | 2023–05–30 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10457 |