nep-cmp New Economics Papers
on Computational Economics
Issue of 2025–12–22
nineteen papers chosen by
Stan Miles, Thompson Rivers University


  1. Integration of LSTM Networks in Random Forest Algorithms for Stock Market Trading Predictions By Juan C. King; Jose M. Amigo
  2. New Approximation Results and Optimal Estimation for Fully Connected Deep Neural Networks By Zhaoji Tang
  3. Learning How to Vote with Principles: Axiomatic Insights Into the Collective Decisions of Neural Networks By Levin Hornischer; Zoi Terzopoulou
  4. Responsible LLM Deployment for High-Stake Decisions by Decentralized Technologies and Human-AI Interactions By Swati Sachan; Theo Miller; Mai Phuong Nguyen
  5. Imputing Measures of Diet Quality Using Circana Scanner Data and Machine Learning By Stevens, Alexander; Okrent, Abigail M.; Mancino, Lisa
  6. Automated data extraction from unstructured text using LLMs: A scalable workflow for Stata users By Loreta Isaraj
  7. The Elaboration of the Patent Processing Instrument Based on Machine Learning Technology By Sheresheva, M.Y.
  8. Measuring Corruption from Text Data By Arieda Mu\c{c}o
  9. Exploring USDA-FSA Farm Lending Patterns: Machine Learning-Based Models for Understanding the Impact of Borrower attributes on Loan Purposes By Zheng, Maoyong; Escalante, Cesar L.
  10. Reinforcement Learning in Financial Decision Making: A Systematic Review of Performance, Challenges, and Implementation Strategies By Mohammad Rezoanul Hoque; Md Meftahul Ferdaus; M. Kabir Hassan
  11. Projecting Entropy of University Culture Dissemination Using the Axelrod Model By Arief Rahman
  12. This Candidate is [MASK]. Prompt-based Sentiment Extraction and Reference Letters By Slonimczyk, Fabian
  13. Forecasting Disaggregated Food Inflation Baskets in Colombia with an XGBoost Model By César Anzola Bravo; Paola Poveda
  14. Differential ML with a Difference By Paul Glasserman; Siddharth Hemant Karmarkar
  15. Partial multivariate transformer as a tool for cryptocurrencies time series prediction By Andrzej Tokajuk; Jaros{\l}aw A. Chudziak
  16. Volatility time series modeling by single-qubit quantum circuit learning By Tetsuya Takaishi
  17. The Economics of Professional Decision-Making: Can Artificial Intelligence Reduce Decision Uncertainty? By W Bentley MacLeod
  18. Artificial Intelligence for Detecting Price Surges Based on Network Features of Crypto Asset Transactions By Yuichi IKEDA; Hideaki AOYAMA; Tetsuo HATSUDA; Tomoyuki SHIRAI; Taro HASUI; Yoshimasa HIDAKA; Krongtum SANKAEWTONG; Hiroshi IYETOMI; Yuta YARAI; Abhijit CHAKRABORTY; Yasushi NAKAYAMA; Akihiro FUJIHARA; Pierluigi CESANA; Wataru SOUMA
  19. Text mining and hierarchical clustering in Stata: An applied approach for real-time policy monitoring, forecasting, and literature mapping. By Carlo Drago

  1. By: Juan C. King; Jose M. Amigo
    Abstract: The aim of this paper is the analysis and selection of stock trading systems that combine different models with data of different nature, such as financial and microeconomic information. Specifically, based on previous work by the authors and applying advanced techniques of Machine Learning and Deep Learning, our objective is to formulate trading algorithms for the stock market with empirically tested statistical advantages, thus improving results published in the literature. Our approach integrates Long Short-Term Memory (LSTM) networks with algorithms based on decision trees, such as Random Forest and Gradient Boosting. While the former analyze price patterns of financial assets, the latter are fed with economic data of companies. Numerical simulations of algorithmic trading with data from international companies and 10-weekday predictions confirm that an approach based on both fundamental and technical variables can outperform the usual approaches, which do not combine those two types of variables. In doing so, Random Forest turned out to be the best performer among the decision trees. We also discuss how the prediction performance of such a hybrid approach can be boosted by selecting the technical variables.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.02036
  2. By: Zhaoji Tang
    Abstract: \citet{farrell2021deep} establish non-asymptotic high-probability bounds for general deep feedforward neural network (with rectified linear unit activation function) estimators, with \citet[Theorem 1]{farrell2021deep} achieving a suboptimal convergence rate for fully connected feedforward networks. The authors suggest that improved approximation of fully connected networks could yield sharper versions of \citet[Theorem 1]{farrell2021deep} without altering the theoretical framework. By deriving approximation bounds specifically for a narrower fully connected deep neural network, this note demonstrates that \citet[Theorem 1]{farrell2021deep} can be improved to achieve an optimal rate (up to a logarithmic factor). Furthermore, this note briefly shows that deep neural network estimators can mitigate the curse of dimensionality for functions with compositional structure and functions defined on manifolds.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.09853
  3. By: Levin Hornischer (LMU - Ludwig Maximilian University [Munich] = Ludwig Maximilians Universität München); Zoi Terzopoulou (GATE Lyon Saint-Étienne - Groupe d'Analyse et de Théorie Economique Lyon - Saint-Etienne - UL2 - Université Lumière - Lyon 2 - UJM - Université Jean Monnet - Saint-Étienne - EM - EMLyon Business School - CNRS - Centre National de la Recherche Scientifique)
    Abstract: Can neural networks be applied in voting theory, while satisfying the need for transparency in collective decisions? We propose axiomatic deep voting: a framework to build and evaluate neural networks that aggregate preferences, using the wellestablished axiomatic method of voting theory. Our findings are: (1) Neural networks, despite being highly accurate, often fail to align with the core axioms of voting rules, revealing a disconnect between mimicking outcomes and reasoning. ( 2) Training with axiom-specific data does not enhance alignment with those axioms. (3) By solely optimizing axiom satisfaction, neural networks can synthesize new voting rules that often surpass and substantially differ from existing ones. This offers insights for both fields: For AI, important concepts like bias and value-alignment are studied in a mathematically rigorous way; for voting theory, new areas of the space of voting rules are explored.
    Keywords: Voting theory, Neural networks
    Date: 2025–08–05
    URL: https://d.repec.org/n?u=RePEc:hal:journl:hal-05395413
  4. By: Swati Sachan; Theo Miller; Mai Phuong Nguyen
    Abstract: High-stakes decision domains are increasingly exploring the potential of Large Language Models (LLMs) for complex decision-making tasks. However, LLM deployment in real-world settings presents challenges in data security, evaluation of its capabilities outside controlled environments, and accountability attribution in the event of adversarial decisions. This paper proposes a framework for responsible deployment of LLM-based decision-support systems through active human involvement. It integrates interactive collaboration between human experts and developers through multiple iterations at the pre-deployment stage to assess the uncertain samples and judge the stability of the explanation provided by post-hoc XAI techniques. Local LLM deployment within organizations and decentralized technologies, such as Blockchain and IPFS, are proposed to create immutable records of LLM activities for automated auditing to enhance security and trace back accountability. It was tested on Bert-large-uncased, Mistral, and LLaMA 2 and 3 models to assess the capability to support responsible financial decisions on business lending.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.04108
  5. By: Stevens, Alexander; Okrent, Abigail M.; Mancino, Lisa
    Keywords: Food Consumption/Nutrition/Food Safety
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea24:343997
  6. By: Loreta Isaraj (IRCrES-CNR)
    Abstract: In several data-rich domains such as finance, medicine, law, and scientific publishing, most of the valuable information is embedded in unstructured textual formats, from clinical notes and legal briefs to financial statements and research papers. These sources are rarely available in structured formats suitable for immediate quantitative analysis. This presentation introduces a scalable and fully integrated workflow that employs large language models (LLMs), specifically ChatGPT 4.0 via API, in conjunction with Python and Stata to extract structured variables from unstructured documents and make them ready for further statistical processing in Stata. As a representative use case, I demonstrate the extraction of information from a SOAP clinical note, treated as a typical example of unstructured medical documentation. The process begins with a single PDF and extends to an automated pipeline capable of batch-processing multiple documents, highlighting the scalability of this approach. The workflow involves PDF parsing and text preprocessing using Python, followed by prompt engineering designed to optimize the performance of the LLM. In particular, the temperature parameter is tuned to a low value (for example, 0.0–0.3) to promote deterministic and concise extraction, minimizing variation across similar documents and ensuring consistency in output structure. Once the LLM returns structured data, typically in JSON or CSV format, it is seamlessly imported into Stata using custom.do scripts that handle parsing (insheet), transformation (split, reshape), and data cleaning. The final dataset is used for exploratory or inferential analysis, with visualization and summary statistics executed entirely within Stata. The presentation also addresses critical considerations including the computationala cost of using commercial LLM APIs (token-based billing), privacy and compliance risks when processing sensitive data (such as patient records), and the potential for bias or hallucination inherent to generative models. To assess the reliability of the extraction process, I report evaluation metrics such as cosine similarity (for text alignment and summarization accuracy) and F1-score (for evaluating named entity and numerical field extraction). By bridging the capabilities of LLMs with Stata’s powerful analysis tools, this workflow equips researchers and analysts with an accessible method to unlock structured insights from complex unstructured sources, extending the reach of empirical research into previously inaccessible text-heavy datasets.
    Date: 2025–10–01
    URL: https://d.repec.org/n?u=RePEc:boc:isug25:13
  7. By: Sheresheva, M.Y. (Lomonosov Moscow State University, Leninskie Gory 1-46, 119991, Moscow, Russia Author-2-Name: Gorlacheva, E.N. Author-2-Workplace-Name: Bauman Moscow State Technical University, 2nd Baumanskaya st. 5, 105005, Moscow, Russia Author-3-Name: Author-3-Workplace-Name: Author-4-Name: Author-4-Workplace-Name: Author-5-Name: Author-5-Workplace-Name: Author-6-Name: Author-6-Workplace-Name: Author-7-Name: Author-7-Workplace-Name: Author-8-Name: Author-8-Workplace-Name:)
    Abstract: " Objective - While managing the innovation activity, it is necessary to base it on reliable sources of scientific and technical information, including patent research. However, the existing variety and scale of patent databases necessitate the development of an instrument that enables processing large volumes of patent information within limited timeframes. In these conditions, it is necessary to use machine learning (ML) technology to create a solid information base for management decisions. Methodology - The objective of the study presented in the paper was to propose an algorithm for processing patent data to improve the quality of patent research. The essence of the algorithm is that all necessary patents are ranked according to a relevance criterion, after which the researcher analyzes the already essential patents. Findings - The paper envisages the algorithm's practical realization using a gravity-driven power generator case. Findings indicate that the proposed new instrument enables a significant reduction in processing time for patent data. Novelty - The paper contributes to innovation management by integrating patent analytics and machine learning. Type of Paper - Empirical"
    Keywords: Innovation activity; patent analytics; machine learning technology; a gravity-driven power generator.
    JEL: D80 D81
    Date: 2025–12–31
    URL: https://d.repec.org/n?u=RePEc:gtr:gatrjs:jber267
  8. By: Arieda Mu\c{c}o
    Abstract: Using Brazilian municipal audit reports, I construct an automated corruption index that combines a dictionary of audit irregularities with principal component analysis. The index validates strongly against independent human coders, explaining 71-73 \% of the variation in hand-coded corruption counts in samples where coders themselves exhibit high agreement, and the results are robust within these validation samples. The index behaves as theory predicts, correlating with municipal characteristics that prior research links to corruption. Supervised learning alternatives yield nearly identical municipal rankings ($R^{2}=0.98$), confirming that the dictionary approach captures the same underlying construct. The method scales to the full audit corpus and offers advantages over both manual coding and Large Language Models (LLMs) in transparency, cost, and long-run replicability.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.09652
  9. By: Zheng, Maoyong; Escalante, Cesar L.
    Keywords: Agricultural Finance, Farm Management, Agribusiness
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea24:343857
  10. By: Mohammad Rezoanul Hoque; Md Meftahul Ferdaus; M. Kabir Hassan
    Abstract: Reinforcement learning (RL) is an innovative approach to financial decision making, offering specialized solutions to complex investment problems where traditional methods fail. This review analyzes 167 articles from 2017--2025, focusing on market making, portfolio optimization, and algorithmic trading. It identifies key performance issues and challenges in RL for finance. Generally, RL offers advantages over traditional methods, particularly in market making. This study proposes a unified framework to address common concerns such as explainability, robustness, and deployment feasibility. Empirical evidence with synthetic data suggests that implementation quality and domain knowledge often outweigh algorithmic complexity. The study highlights the need for interpretable RL architectures for regulatory compliance, enhanced robustness in nonstationary environments, and standardized benchmarking protocols. Organizations should focus less on algorithm sophistication and more on market microstructure, regulatory constraints, and risk management in decision-making.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.10913
  11. By: Arief Rahman (Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia Author-2-Name: Sri Gunani Partiwi Author-2-Workplace-Name: Department of Industrial and Systems Engineering, Institut Teknologi Sepuluh Nopember, 60111, Surabaya, Indonesia Author-3-Name: Ratna Sari Dewi Author-3-Workplace-Name: Department of Industrial and Systems Engineering, Institut Teknologi Sepuluh Nopember, 60111, Surabaya, Indonesia Author-4-Name: Sri Rachmi Dewi Author-4-Workplace-Name: Department of Industrial and Systems Engineering, Institut Teknologi Sepuluh Nopember, 60111, Surabaya, Indonesia Author-5-Name: Author-5-Workplace-Name: Author-6-Name: Author-6-Workplace-Name: Author-7-Name: Author-7-Workplace-Name: Author-8-Name: Author-8-Workplace-Name:)
    Abstract: "Objective - This research aimed to evaluate the effectiveness of university culture dissemination by introducing cultural entropy as an objective metric and by modeling its evolution under agent interactions in social networks. The analysis served as an empirically calibrated test of the dynamics of culture dissemination using a simulation framework. Methodology/Technique - Employees from five university divisions participated in an organizational culture survey. This survey was used to create individual cultural profiles. A modified Axelrod model was adopted to simulate the spread of cultural values in an agent-based environment. Furthermore, division-level cultural profiles were mapped using entropy. Forward projections were also used to assess entropy reduction and reveal polarization patterns. Findings - The estimated university-wide cultural entropy was 1.52, indicating significant dispersion in cultural values. Simulations showed that two divisions could experience rapid reductions in entropy, suggesting faster cultural balance. Meanwhile, three divisions showed continued high entropy or slow improvement. Polarization analysis identified that certain standard cultural values became more dominant. Some were weakened since cultural unification progressed unevenly across divisions. Novelty - This research showed novelty in three ways, namely (i) the use of an Axelrod-based model calibrated with higher-education survey data. (ii) Introduction of cultural entropy as a simple metric for tracking dissemination at the university and division level. (iii) The use of real survey results to set up an agent-based model that connects actual culture to simulated interactions. Type of Paper - Empirical"
    Keywords: organizational culture; Axelrod model; agent-based simulation; cultural entropy; higher education.
    JEL: C63 D8 I23 L16 Z13
    Date: 2025–12–31
    URL: https://d.repec.org/n?u=RePEc:gtr:gatrjs:jmmr355
  12. By: Slonimczyk, Fabian
    Abstract: I propose a relatively simple way to deploy pre-trained large language models (LLMs) in order to extract sentiment and other useful features from text data. The method, which I refer to as prompt-based sentiment extraction, offers multiple advantages over other methods used in economics and finance. In particular, it accepts the text input as is (without preprocessing) and produces a sentiment score that has a probability interpretation. Unlike other LLM-based approaches, it does not require any fine-tuning or labeled data. I apply my prompt-based strategy to a hand-collected corpus of confidential reference letters (RLs). I show that the sentiment contents of RLs are clearly reflected in job market outcomes. Candidates with higher average sentiment in their RLs perform markedly better regardless of the measure of success chosen. Moreover, I show that sentiment dispersion among letter writers negatively affects the job market candidate’s performance. I compare my sentiment extraction approach to other commonly used methods for sentiment analysis: ‘bag-of-words’ approaches, fine-tuned language models, and querying advanced chatbots. No other method can fully reproduce the results obtained by prompt-based sentiment extraction. Finally, I slightly modify the method to obtain ‘gendered’ sentiment scores (as in Eberhardt et al., 2023). I show that RLs written for female candidates emphasize ‘grindstone’ personality traits, whereas male candidates’ letters emphasize ‘standout’ traits. These gender differences negatively affect women’s job market outcomes.
    Keywords: Large language models; text data; sentiment analysis; reference letters
    JEL: C45 J16 M51
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:126675
  13. By: César Anzola Bravo; Paola Poveda
    Abstract: Food prices have consistently been one of the leading contributors to Colombia’s inflation rate. They are particularly sensitive to exogenous factors such as extreme weather events, supply chain disruptions, and global commodity price shocks, often resulting in sharp and unpredictable price fluctuations. This document pursues two main objectives. First, it aims to estimate and evaluate methods for forecasting 33 homogeneous food inflation baskets, which together constitute the total food Consumer Price Index (Food CPI), offering tools that can assist policymakers in anticipating the drivers of future inflation. This includes both traditional time series models and modern machine learning approaches. Second, it seeks to enhance the interpretability of model predictions through explainable AI techniques. To achieve this, we propose a variable lag selection algorithm to identify optimal feature-lag pairs, and employ SHAP (SHapley Additive exPlanations) values to quantify the contribution of each feature to the model’s forecast. Our findings indicate that machine learning models outperform traditional approaches in forecasting food inflation, delivering improved accuracy across most individual baskets as well as for aggregated food inflation. *****RESUMEN: Los precios de los alimentos han sido uno de los principales factores que contribuyen a la inflación en Colombia. Estos son particularmente sensibles a factores externos como choques climáticos, interrupciones en las cadenas globales de valor y choques en los precios de los productos básicos a nivel global, lo que resulta en fluctuaciones impredecibles de precios. Este documento tiene dos objetivos. En primer lugar, busca estimar y evaluar métodos para pronosticar 33 canastas homogéneas de inflación de alimentos, ofreciendo herramientas que puedan ayudar a los hacedores de política anticipar los factores que afectan la inflación de alimentos futura. Esto incluye tanto modelos tradicionales de series de tiempo como enfoques modernos de machine learning. En segundo lugar, se propone mejorar la interpretabilidad de las predicciones de los modelos mediante técnicas de explainableAI. Para ello, proponemos un algoritmo de selección de variables que identifique las variables explicativas más relevantes, y utilizamos valores SHAP (SHapley Additive exPlanations) para cuantificar la contribución de cada variable explicativa en las predicciones del modelo. Nuestros hallazgos indican que los modelos de machine learning superan a los enfoques tradicionales en el pronóstico de la inflación de alimentos, logrando una mayor precisión tanto en la mayoría de las canastas individuales como en la inflación de alimentos agregada.
    Keywords: Macroeconomic Forecasts, Food Prices, Machine learning, Pronóstico Macroeconómico, Inflación de alimentos
    JEL: C53 E31 E37
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:bdr:borrec:1335
  14. By: Paul Glasserman; Siddharth Hemant Karmarkar
    Abstract: Differential ML (Huge and Savine 2020) is a technique for training neural networks to provide fast approximations to complex simulation-based models for derivatives pricing and risk management. It uses price sensitivities calculated through pathwise adjoint differentiation to reduce pricing and hedging errors. However, for options with discontinuous payoffs, such as digital or barrier options, the pathwise sensitivities are biased, and incorporating them into the loss function can magnify errors. We consider alternative methods for estimating sensitivities and find that they can substantially reduce test errors in prices and in their sensitivities. Using differential labels calculated through the likelihood ratio method expands the scope of Differential ML to discontinuous payoffs. A hybrid method incorporates gamma estimates as well as delta estimates, providing further regularization.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.05301
  15. By: Andrzej Tokajuk; Jaros{\l}aw A. Chudziak
    Abstract: Forecasting cryptocurrency prices is hindered by extreme volatility and a methodological dilemma between information-scarce univariate models and noise-prone full-multivariate models. This paper investigates a partial-multivariate approach to balance this trade-off, hypothesizing that a strategic subset of features offers superior predictive power. We apply the Partial-Multivariate Transformer (PMformer) to forecast daily returns for BTCUSDT and ETHUSDT, benchmarking it against eleven classical and deep learning models. Our empirical results yield two primary contributions. First, we demonstrate that the partial-multivariate strategy achieves significant statistical accuracy, effectively balancing informative signals with noise. Second, we experiment and discuss an observable disconnect between this statistical performance and practical trading utility; lower prediction error did not consistently translate to higher financial returns in simulations. This finding challenges the reliance on traditional error metrics and highlights the need to develop evaluation criteria more aligned with real-world financial objectives.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.04099
  16. By: Tetsuya Takaishi
    Abstract: We employ single-qubit quantum circuit learning (QCL) to model the dynamics of volatility time series. To assess its effectiveness, we generate synthetic data using the Rational GARCH model, which is specifically designed to capture volatility asymmetry. Our results show that QCL-based volatility predictions preserve the negative return-volatility correlation, a hallmark of asymmetric volatility dynamics. Moreover, analysis of the Hurst exponent and multifractal characteristics indicates that the predicted series, like the original synthetic data, exhibits anti-persistent behavior and retains its multifractal structure.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.10584
  17. By: W Bentley MacLeod (Cowles Foundation for Research in Economics, Yale University)
    Abstract: This paper outlines an economic model that provides a framework for organising the growing literature on the performance of physicians and judges. The primary task of these professionals is to make decisions based on the information provided by their clients. The paper discusses professional decisions in terms of what Kahneman (2011) calls fast and slow decisions, known as System 1 and System 2 in cognitive science. Slow decisions correspond to the economist's model of rational choice, while System 1 (fast) decisions are high-speed, intuitive choices guided by training and human capital. This distinction is used to provide a model of decision-making under uncertainty based on Bewley (2011)'s theory of Knightian uncertainty to show that human values are an essential input to optimal choice. This, in turn, provides conditions under which artificial intelligence (AI) tools can assist professional decision-making, while pointing to cases where such tools need to explicitly incorporate human values in order to make better decisions.
    Date: 2025–12–01
    URL: https://d.repec.org/n?u=RePEc:cwl:cwldpp:2475
  18. By: Yuichi IKEDA; Hideaki AOYAMA; Tetsuo HATSUDA; Tomoyuki SHIRAI; Taro HASUI; Yoshimasa HIDAKA; Krongtum SANKAEWTONG; Hiroshi IYETOMI; Yuta YARAI; Abhijit CHAKRABORTY; Yasushi NAKAYAMA; Akihiro FUJIHARA; Pierluigi CESANA; Wataru SOUMA
    Abstract: This study proposes an artificial intelligence framework to detect price surges in crypto assets by leveraging network features extracted from transaction data. Motivated by the challenges in Anti-Money Laundering, Countering the Financing of Terrorism, and Counter-Proliferation Financing, we focus on structural features within crypto asset networks that may precede extreme market events. Building on theories from complex network analysis and rate-induced tipping, we characterize early warning signals. Granger causality is applied for feature selection, identifying network dynamics that causally precede price movements. To quantify surge likelihood, we employ a Boltzmann machine as a generative model to derive nonlinear indicators that are sensitive to critical shifts in transactional topology. Furthermore, we develop a method to trace back and identify individual nodes that contribute significantly to price surges. The findings have practical implications for investors, risk management officers, regulatory supervision by financial authorities, and the evaluation of systemic risk. This framework presents a novel approach to integrating explainable AI, financial network theory, and regulatory objectives in crypto asset markets.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:eti:dpaper:25113
  19. By: Carlo Drago (Università degli Studi Niccolò Cusano)
    Abstract: This presentation shows an applied framework for text mining and clustering in the Stata environment and provides practical tools for policy-relevant research in economics and health economics. With the growing amount of unstructured textual data—from financial news and analyst reports to scientific publications— there is an increasing demand for scalable methods to classify and interpret such information for evidence-based policy and forecasting. A first relevant concept is the Stata capacity to be integrated with Python with aim to implement hierarchical clustering from scratch using TF-IDF vectorization and cosine distance. This technique is specifically applied to economic text sources—such as headlines or institutional communications—with the aim to segment documents into a fixed or silhouette- optimized number of clusters. This approach allows researchers to identify patterns on data, uncover latent themes, and organize information for macroeconomic forecasting, sentiment analysis, or real-time policy monitoring. In the second part, I focus on literature mapping in health economics. Using a curated corpus of article titles related to telemedicine and diabetes, I apply a native Stata pipeline based on text normalization and clustering to identify thematic areas within the literature. The approach promotes organized reviews in health technology assessment and policy evaluation and makes evidence synthesis more accessible. By combining native Stata capabilities with Python-enhanced workflows, I provide applied researchers with an accessible and policy-relevant toolkit for unsupervised text classification in multiple domains.
    Date: 2025–10–01
    URL: https://d.repec.org/n?u=RePEc:boc:isug25:14

This nep-cmp issue is ©2025 by Stan Miles. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.