|
on Big Data |
By: | Fabio Gatti; Joel Huesler |
Abstract: | The correspondence of historical personalities serves as a rich source of psychological, social, and economic information. Letters were indeed used as means of communication within the family circles but also a primary method for exchanging information with colleagues, subordinates, and employers. A quantitative analysis of such material enables scholars to reconstruct both the internal psychology and the relational networks of historical figures, ultimately providing deeper insights into the socio-economic systems in which they were embedded. In this study, we analyze the outgoing correspondence of Michelangelo Buonarroti, a prominent Renaissance artist, using a collection of 523 letters as the basis for a structured text analysis. Our methodological approach compares three distinct Natural Language Processing Methods: an Augmented Dictionary Approach, which relies on static lexicon analysis and Latent Dirichlet Allocation (LDA) for topic modeling, a Supervised Machine Learning Approach that utilizes BERT-generated letter embeddings combined with a Random Forest classifier trained by the authors, and an Unsupervised Machine Learning Method. The comparison of these three methods, benchmarked to biographic knowledge, allows us to construct a robust understanding of Michelangelo’s emotional association to monetary, thematic, and social factors. Furthermore, it highlights how the Supervised Machine Learning method, by incorporating the authors’ domain knowledge and understanding of documents and background, can provide, in the context of Renaissance multi-themed letters, a more nuanced interpretation of contextual meanings, enabling the detection of subtle (positive or negative) sentimental variations due to a variety of factors that other methods can overlook. |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:baf:cbafwp:cbafwp25251 |
By: | Yonggeun Jung |
Abstract: | This paper proposes a scalable framework to estimate monthly GDP using machine learning methods. We apply Multi-Layer Perceptron (MLP), Long Short-Term Memory networks (LSTM), Extreme Gradient Boosting (XGBoost), and Elastic Net regression to map monthly indicators to quarterly GDP growth, and reconcile the outputs with actual aggregates. Using data from China, Germany, the UK, and the US, our method delivers robust performance across varied data environments. Benchmark comparisons with prior US studies and UK official statistics validate its accuracy. The approach offers a flexible and data-driven tool for high-frequency macroeconomic monitoring and policy analysis. |
Date: | 2025–06 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2506.14078 |
By: | Raffaella Barone |
Abstract: | This paper examines the relationship between non-residential property prices and various social, economic, and environmental indicators within the provinces where these properties are located. We focus on indicators from the Eni Enrico Mattei Foundation and SDSN Italia that track the 17 sustainable development goals, as well as additional factors like crime rates, per capita GDP, and sales frequency. Using a machine learning algorithm, we predicted property sale prices and applied SHapley Additive exPlanations to assess the importance of each variable. Our findings highlight the strong influence of categorical variables and SDG indicators on prices. Finally, we used causal inference to explore how policy interventions might affect property prices. |
Keywords: | Machine Learning, Real estate market, Financial Stability, Sustainability, Crimes |
JEL: | B4 C1 G01 R33 |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:baf:cbafwp:cbafwp25238 |
By: | Charles Shaw |
Abstract: | The double/debiased machine learning (DML) framework has become a cornerstone of modern causal inference, allowing researchers to utilise flexible machine learning models for the estimation of nuisance functions without introducing first-order bias into the final parameter estimate. However, the choice of machine learning model for the nuisance functions is often treated as a minor implementation detail. In this paper, we argue that this choice can have a profound impact on the substantive conclusions of the analysis. We demonstrate this by presenting and comparing two distinct Distributional Instrumental Variable Local Average Treatment Effect (D-IV-LATE) estimators. The first estimator leverages standard machine learning models like Random Forests for nuisance function estimation, while the second is a novel estimator employing Kolmogorov-Arnold Networks (KANs). We establish the asymptotic properties of these estimators and evaluate their performance through Monte Carlo simulations. An empirical application analysing the distributional effects of 401(k) participation on net financial assets reveals that the choice of machine learning model for nuisance functions can significantly alter substantive conclusions, with the KAN-based estimator suggesting more complex treatment effect heterogeneity. These findings underscore a critical "caveat emptor". The selection of nuisance function estimators is not a mere implementation detail. Instead, it is a pivotal choice that can profoundly impact research outcomes in causal inference. |
Date: | 2025–06 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2506.12765 |
By: | Afroze, Farhana |
Abstract: | Introduction Bangladesh symbolizes how systematic gender bias impairs women's health. Economic instability, violence, mental health issues, and environmental vulnerability are all interconnected issues that exacerbate the socio-economic challenges women face in their day-to-day lives. Eventually, it makes women more vulnerable to developing non-communicable diseases. Objective This study aims to establish causal links between poverty, gender disparity, and NCD risks in women. It is one of the first studies to execute machine learning techniques to explore the relationship between gender disparity and NCD mortality among Bangladeshi women. The paper evaluates the multidimensional aspect of gender norms that strain women's health. Methodology data analysis was done using a synthetic dataset generated using GAN that mimics real-world datasets. OLS, random forest, lasso regression, and XGboost were employed for assessing research objectives. Results The primary results identify income level as the main predictor of NCD mortality. Unemployment rate, unpaid domestic labor, and high stress levels are the secondary predictors. Conclusion Addressing economic, socioeconomic, and cultural oppression is crucial for improving the country's health. The government and policymakers need to introduce gendered health policies to improve health equity in Bangladesh. |
Keywords: | Gender Disparity, Non-Communicable Diseases (NCDs), Women's Health, Economic Instability, Poverty, Gender Bias, Machine Learning, Synthetic Data, OLS Regression, Random Forest, Lasso Regression, XGBoost, Unemployment, Domestic Labor, Mental Health, Stress, Health Equity, Policy Intervention, Bangladesh. |
JEL: | I1 I12 I14 I18 I32 J1 J11 J12 J16 J18 |
Date: | 2024–12–10 |
URL: | https://d.repec.org/n?u=RePEc:pra:mprapa:123520 |
By: | Leland D. Crane; Akhil Karra; Paul E. Soto |
Abstract: | We evaluate the ability of large language models (LLMs) to estimate historical macroeconomic variables and data release dates. We find that LLMs have precise knowledge of some recent statistics, but performance degrades as we go farther back in history. We highlight two particularly important kinds of recall errors: mixing together first print data with subsequent revisions (i.e., smoothing across vintages) and mixing data for past and future reference periods (i.e., smoothing within vintages). We also find that LLMs can often recall individual data release dates accurately, but aggregating across series shows that on any given day the LLM is likely to believe it has data in hand which has not been released. Our results indicate that while LLMs have impressively accurate recall, their errors point to some limitations when used for historical analysis or to mimic real time forecasters. |
Keywords: | Artificial intelligence; Forecasting; Large language models; Real-time data |
JEL: | C53 C80 E37 |
Date: | 2025–06–25 |
URL: | https://d.repec.org/n?u=RePEc:fip:fedgfe:2025-44 |
By: | Dominic Zaun Eu Jones |
Abstract: | I develop Ornithologist, a weakly-supervised textual classification system and measure the hawkishness and dovishness of central bank text. Ornithologist uses ``taxonomy-guided reasoning'', guiding a large language model with human-authored decision trees. This increases the transparency and explainability of the system and makes it accessible to non-experts. It also reduces hallucination risk. Since it requires less supervision than traditional classification systems, it can more easily be applied to other problems or sources of text (e.g. news) without much modification. Ornithologist measurements of hawkishness and dovishness of RBA communication carry information about the future of the cash rate path and of market expectations. |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2505.09083 |
By: | Nicolas Houli\'e |
Abstract: | I show that house prices can be modeled using machine learning (kNN and tree-bagging) and a small dataset composed of macro-economic factors (MEF), including an inflation metric (CPI), US treasury rates (10-yr), Gross Domestic Product (GDP), and portfolio size of central banks (ECB, FED). This set of parameters covers all the parties involved in a transaction (buyer, seller, and financing facility) while ignoring the intrinsic properties of each asset and encompassing local (inflation) and liquidity issues that may impede each transaction composing a market. The model here takes the point of view of a real estate trader who is interested in both the financing and the price of the transaction. Machine Learning allows for the discrimination of two periods within the dataset. Unconventional policies of central banks may have allowed some institutional investors to arbitrage between real estate returns and other bond markets (sovereign and corporate). Finally, to assess the models' relative performances, I performed various sensitivity tests, which tend to constrain the possibilities of each approach for each need. I also show that some models can predict the evolution of prices over the next 4 quarters with uncertainties that outperform existing index uncertainties. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2505.09620 |
By: | Samuel Kaplan (UNC/UDESA); Efstathios Polyzos (Zayed University/CAMA Australia); David Tercero-Lucas (ICADE/ICAI/Universidad Pontificia Comillas) |
Abstract: | The growing influence of cryptocurrencies in global financial markets has raise questions about the impact of central bank communications on their price dynamics.This paper investigates how central bank communication affects the behaviour of cryptocurrency markets. Using a dataset of over 6, 000 central bank speeches anda broad panel of crypto assets, we quantify sentiment, uncertainty, and fear tone through natural language processing and assess their impact using local projectionmethods. Our results show that positive tone initially depresses returns while raising volatility, whereas uncertainty and fear produce mixed return responses and amplifyprice fluctuations in the short run. Heterogeneity across asset types reveals stronger responses among emerging, high-performing, and non-stablecoin cryptocurrencies.The findings highlight the informational role of central bank narratives in shaping outcomes in speculative and decentralised markets, with important implications forcommunication policy and financial stability monitoring. |
Keywords: | Cryptocurrency, Central Bank Communication, Text Analysis, Sentiment Analysis, Monetary Policy |
JEL: | D53 E52 E58 G15 O33 |
Date: | 2025–07 |
URL: | https://d.repec.org/n?u=RePEc:aoz:wpaper:365 |
By: | Xin Sheng (Lord Ashcroft International Business School, Anglia Ruskin University, Chelmsford, United Kingdom); Oguzhan Cepni (Ostim Technical University, Ankara, Turkiye; University of Edinburgh Business School, Centre for Business, Climate Change, and Sustainability; Department of Economics, Copenhagen Business School, Denmark); Rangan Gupta (Department of Economics, University of Pretoria, Private Bag X20, Hatfield 0028, South Africa); Minko Markovski (Department of Economics, University of Reading, Reading, United Kingdom) |
Abstract: | We forecast quarterly growth rate of real gross fixed capital formation of the United States using the information content of a monthly metric of extreme weather conditions, while controlling for a set of principal components derived from a large data set of economic and financial indicators. In this regard, we utilize a Mixed Frequency Machine Learning framework over the period of 1974:Q1 to 2022:Q1. Our results show that incorporating monthly data on severe climatic conditions significantly, especially information contained in relatively higher (above the mean) extreme weather values, outperforms not only the benchmark autoregressive model, but also the econometric framework that includes the macro-finance factors when forecasting the growth rate of quarterly real gross fixed capital formation. |
Keywords: | Gross fixed capital formation, Extreme weather conditions, Mixed frequency, Machine learning, Forecasting |
JEL: | C22 C53 E22 Q54 |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:pre:wpaper:202520 |
By: | Dhanashree Somani (Department of Statistics, University of Florida, 230 Newell Drive, Gainesville, FL, 32601, USA); Rangan Gupta (Department of Economics, University of Pretoria, Private Bag X20, Hatfield 0028, South Africa); Sayar Karmakar (Department of Statistics, University of Florida, 230 Newell Drive, Gainesville, FL, 32601, USA); Vasilios Plakandaras (Department of Economics, Democritus University of Thrace, Komotini, Greece) |
Abstract: | The objective of this paper is to forecast volatilities of the stock returns of China, France, Germany, Italy, Spain, the United Kingdom (UK), and the United States (US) over the daily period of January 2010 to February 2025 by utilizing the information content of newspapers articles-based indexes of supply bottlenecks. We measure volatility by employing the interquantile range, estimated via an asymmetric slope autoregressive quantile regression fitted on stock returns to derive the conditional quantiles. In the process, we are also able to obtain estimates of skewness, kurtosis, lower- and upper-tail risks, and incorporate them into our linear predictive model, alongside leverage. Based on the shrinkage estimation using the Lasso estimator to control for overparameterization, we find that the model with moments outperform the benchmark model that includes both own- and cross-country volatilities, but the performance of the former, is improved further when we incorporate the role of the metrics of supply constraints for all the 7 countries simultaneously. These findings carry significant implications for investors. |
Keywords: | Supply Bottlenecks, Stock Market Volatility, Asymmetric Autoregressive Quantile Regression, Lasso Estimator, Forecasting |
JEL: | C22 C53 E23 G15 |
Date: | 2025–06 |
URL: | https://d.repec.org/n?u=RePEc:pre:wpaper:202521 |
By: | Marco Zanotti |
Abstract: | Given the continuous increase in dataset sizes and the complexity of forecasting models, the tradeoff between forecast accuracy and computational cost is emerging as an extremely relevant topic, especially in the context of ensemble learning for time series forecasting. To asses it, we evaluated ten base models and eight ensemble configurations across two large-scale retail datasets (M5 and VN1), considering both point and probabilistic accuracy under varying retraining frequencies. We showed that ensembles consistently improve forecasting performance, particularly in probabilistic settings. However, these gains come at a substantial computational cost, especially for larger, accuracy-driven ensembles. We found that reducing retraining frequency significantly lowers costs, with minimal impact on accuracy, particularly for point forecasts. Moreover, efficiency-driven ensembles offer a strong balance, achieving competitive accuracy with considerably lower costs compared to accuracy-optimized combinations. Most importantly, small ensembles of two or three models are often sufficient to achieve near-optimal results. These findings provide practical guidelines for deploying scalable and cost-efficient forecasting systems, supporting the broader goals of sustainable AI in forecasting. Overall, this work shows that careful ensemble design and retraining strategy selection can yield accurate, robust, and cost-effective forecasts suitable for real-world applications. |
Keywords: | Time series, Demand forecasting, Forecasting competitions, Cross-learning, Global models, Forecast combinations, Ensemble learning, Machine learning, Deep learning, Conformal predictions, Green AI. |
JEL: | C53 C52 C55 |
Date: | 2025–06 |
URL: | https://d.repec.org/n?u=RePEc:mib:wpaper:554 |
By: | Hanming Fang; Ming Li; Guangli Lu |
Abstract: | We decode China’s industrial policies from 2000 to 2022 by employing large language models (LLMs) to extract and analyze rich information from a comprehensive dataset of 3 million documents issued by central, provincial, and municipal governments. Through careful prompt engineering, multistage extraction and refinement, and rigorous verification, we use LLMs to classify the industrial policy documents and extract structured information on policy objectives, targeted industries, policy tones (supportive or regulatory/suppressive), policy tools, implementation mechanisms, and intergovernmental relationships, etc. Combining these newly constructed industrial policy data with micro-level firm data, we document four sets of facts about China's industrial policy that explore the following questions: What are the economic and political foundations of the targeted industries? What policy tools are deployed? How do policy tools vary across different levels of government and regions, as well as over the phases of an industry's development? What are the impacts of these policies on firm behavior, including entry, production, and productivity growth? We also explore the political economy of industrial policy, focusing on top-down transmission mechanisms, policy persistence, and policy diffusion across regions. Finally, we document spatial inefficiencies and industry-wide overcapacity as potential downsides of industrial policies. |
JEL: | C55 L52 O25 |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:33814 |
By: | Nsababera, Olive; Dickens, Richard; Disney, Richard |
Abstract: | With the rise of forced displacement, attention has turned to the economic impact of refugees. However, few studies investigate long-term impacts. We use data for Tanzania for the period 1985–2015 to examine the effect of camps on urbanisation and local development, exploiting a unique satellite-derived dataset of high spatial resolution and temporal frequency. We show a modest but significant effect of refugee camps on built-up area up to a 100 km distance. We then match camp locations to regional gross domestic product, local consumption spending and employment patterns. Output in areas with camps grew at a faster rate during camp operation, but closure of camps was associated with change in economic activity. Activity induced by camps is largely in non-tradeable goods and services rather than inducing longer run structural transformation. |
Keywords: | refugee camp; urbanisation; satellite imagery; consumption; spatial |
JEL: | J61 O15 R12 R14 |
Date: | 2023–12–20 |
URL: | https://d.repec.org/n?u=RePEc:ehl:lserod:121186 |
By: | William M. Cassidy; Elisabeth Kempf |
Abstract: | We construct a novel measure of partisan corporate speech using natural language processing techniques and use it to establish three stylized facts. First, the volume of partisan corporate speech has risen sharply between 2012 and 2022. Second, this increase has been disproportionately driven by companies adopting more Democratic-leaning language, a trend that is widespread across industries, geographies, and CEO political affiliations. Third, partisan corporate statements are followed by negative abnormal stock returns, with significant heterogeneity by shareholders’ degree of alignment with the statement. Finally, we propose a theoretical framework and provide suggestive empirical evidence that these trends are at least in part driven by a shift in investors’ nonpecuniary preferences with respect to partisan corporate speech. |
JEL: | G0 G1 G23 G3 G4 |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:33810 |
By: | Martin Feldkircher; Petr Korab; Viktoriya Teliha |
Abstract: | This paper analyzes the evolution of central bank topics using a corpus of over 20, 000 speeches spanning nearly three decades and a range of topic models. We identify thirteen themes, including monetary policy, financial stability, digital payments, and climate-related finance. Examining their development over time, we classify these themes as "evergreens", "waning threads", or emergent "rising stars", and show that early adoption and topic leadership are nearly equally shared between emerging and advanced economies' central banks. In the aftermath of the Global Financial Crisis, topic focus converged worldwide, with a renewed emphasis on financial stability. Finally, static covariate regressions link topic prevalence to inflation regimes, central bank independence, and speech format, highlighting the impact of macroeconomic and institutional factors on communication priorities. |
Keywords: | monetary policy, financial stability, digital payments, climate-related finance |
JEL: | C55 C88 E52 E58 D83 |
Date: | 2025–06 |
URL: | https://d.repec.org/n?u=RePEc:een:camaaa:2025-35 |
By: | Jorge Onrubia; Fernando Pinto; María del Carmen Rodado Ruíz |
Abstract: | This paper examines the predictive relationship between online search behavior and international housing demand, focusing on UK citizens purchasing property in Spain from 2014 to 2024. Using Google Trends data for the search term "Spain villas" alongside official transaction records, we estimate autoregressive(AR), argumented(ARX), and interaction models to asses whether digital intent anticipates real estate purchases.Results show that search intensity significantly enhances model performance before the 2016 Brexit referendum |
Date: | 2025–07 |
URL: | https://d.repec.org/n?u=RePEc:fda:fdaddt:2025-08 |
By: | Paula Bejarano Carbo; Rory MacQueen; Efthymios Xylangouras |
Abstract: | Our predecessors at NIESR helped pioneer the estimation of monthly Gross Domestic Product (GDP) in the United Kingdom (Mitchell et al., 2005). This approach, which led to the publication of monthly GDP by the Office for National Statistics (ONS) from 2018, aggregates sectoral indices (e.g. agriculture, construction) to produce an overall estimate of monthly economic growth. Since 2018, NIESR has produced its monthly GDP 'tracker' on the ONS estimate release date, commenting on the latest data point and producing a 'bottom-up' forecast (i.e. constructed by aggregating sectoral forecasts) of economic performance up to the end of the next quarter (Kara et al., 2018). The NIESR GDP tracker forecast of 10 sub-sectors and aggregate GDP combines a 'fixed parameter' approach and forecaster judgement to produce, we find, accurate estimates one month in advance of the official first estimate. |
Date: | 2025–06 |
URL: | https://d.repec.org/n?u=RePEc:nsr:niesrp:46 |
By: | Stefaniia Parubets; Hisahiro Naito |
Abstract: | This study evaluates the effectiveness of satellite-derived tropospheric nitrogen dioxide (NO2) concentrations as a proxy for economic activity in Japan. While nighttime light (NTL) data has been widely used to approximate economic output, recent research has highlighted its' key limitations. In particular, the relationship between NTL and economic outcomes weakens in sub-sample analyses with shorter time spans or restricted geographic coverage. NTL data also faces several key limitations: saturation in dense urban areas reduces measurement accuracy, capturing nighttime emissions fails to account for essential daytime economic activity, inconsistent sensors across different satellites introduce measurement variability, and the technology's sensitivity diminishes when differentiating economic development beyond certain brightness thresholds. Our results show that NO2's effectiveness as an economic proxy is highly dependent on spatial resolution. Using 0.25 degree esolution NO2 data, we find statistically significant relationships with prefecture-level GDP across multiple sectors. Mining shows the strongest elasticity (3.02%), followed by electricity, gas, and water (1.51%), and manufacturing (0.48%). Agriculture, forestry, and fisheries exhibit negative associations (-0.11%), consistent with vegetation serving as NO2 sinks. However, when using higher resolution 0.1 degree NO2 data, these relationships largely disappear, with most coefficients becoming statistically insignificant and sometimes counterintuitive. These findings highlight the importance of matching satellite data resolution to the geographic scale of economic analysis, with coarser resolution being optimal for prefecture-level analysis in Japanese context. This research demonstrates NO2's potential as a more reliable alternative to NTL for economic monitoring when appropriately calibrated. This study examines the effect of exports on subnational income and regional inequality between urban (trade hub) and rural (non–trade hub) areas, using nighttime luminosity as a proxy for economic activity. We construct a country-period panel dataset covering 104 countries, based on five-year average data from 1997 to 2020. Trade hub areas are defined as the union of areas within a 30 km or 50 km radius of each of the three largest ports and three international airports in a country, while all remaining areas are classified as non–trade hub areas. To address endogeneity, we employ a two-stage least squares (2SLS) approach, using predicted trade as an instrumental variable. Predicted trade is derived from a dynamic gravity equation in which time dummies are interacted with sea and air transport distances. This instrument captures variation in transportation costs driven by technological advances that have shifted trade from sea to air, thereby influencing trade volumes. Our results show that a 1\% increase in exports raises nighttime luminosity by 0.3% in trade hub areas and by 0.06\% in non–trade hub areas. Export growth also leads to population increases in trade hub areas, but not in non–trade hub areas. Furthermore, we find that a 1% increase in exports raises nighttime luminosity per capita by 0.18% in trade hub areas and by 0.06% in non–trade hub areas. These findings suggest that while exports stimulate economic activity in trade hubs, population inflows partially offset per capita gains. Nonetheless, exports significantly exacerbate regional inequality. |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:tsu:tewpjp:2025-002 |