nep-big New Economics Papers
on Big Data
Issue of 2025–12–22
nine papers chosen by
Tom Coupé, University of Canterbury


  1. Integration of LSTM Networks in Random Forest Algorithms for Stock Market Trading Predictions By Juan C. King; Jose M. Amigo
  2. The Elaboration of the Patent Processing Instrument Based on Machine Learning Technology By Sheresheva, M.Y.
  3. This Candidate is [MASK]. Prompt-based Sentiment Extraction and Reference Letters By Slonimczyk, Fabian
  4. Forecasting Disaggregated Food Inflation Baskets in Colombia with an XGBoost Model By César Anzola Bravo; Paola Poveda
  5. Measuring Corruption from Text Data By Arieda Mu\c{c}o
  6. Automated data extraction from unstructured text using LLMs: A scalable workflow for Stata users By Loreta Isaraj
  7. Responsible LLM Deployment for High-Stake Decisions by Decentralized Technologies and Human-AI Interactions By Swati Sachan; Theo Miller; Mai Phuong Nguyen
  8. Text mining and hierarchical clustering in Stata: An applied approach for real-time policy monitoring, forecasting, and literature mapping. By Carlo Drago
  9. Job Satisfaction Through the Lens of Social Media: Rural--Urban Patterns in the U.S By Stefano M Iacus; Giuseppe Porro

  1. By: Juan C. King; Jose M. Amigo
    Abstract: The aim of this paper is the analysis and selection of stock trading systems that combine different models with data of different nature, such as financial and microeconomic information. Specifically, based on previous work by the authors and applying advanced techniques of Machine Learning and Deep Learning, our objective is to formulate trading algorithms for the stock market with empirically tested statistical advantages, thus improving results published in the literature. Our approach integrates Long Short-Term Memory (LSTM) networks with algorithms based on decision trees, such as Random Forest and Gradient Boosting. While the former analyze price patterns of financial assets, the latter are fed with economic data of companies. Numerical simulations of algorithmic trading with data from international companies and 10-weekday predictions confirm that an approach based on both fundamental and technical variables can outperform the usual approaches, which do not combine those two types of variables. In doing so, Random Forest turned out to be the best performer among the decision trees. We also discuss how the prediction performance of such a hybrid approach can be boosted by selecting the technical variables.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.02036
  2. By: Sheresheva, M.Y. (Lomonosov Moscow State University, Leninskie Gory 1-46, 119991, Moscow, Russia Author-2-Name: Gorlacheva, E.N. Author-2-Workplace-Name: Bauman Moscow State Technical University, 2nd Baumanskaya st. 5, 105005, Moscow, Russia Author-3-Name: Author-3-Workplace-Name: Author-4-Name: Author-4-Workplace-Name: Author-5-Name: Author-5-Workplace-Name: Author-6-Name: Author-6-Workplace-Name: Author-7-Name: Author-7-Workplace-Name: Author-8-Name: Author-8-Workplace-Name:)
    Abstract: " Objective - While managing the innovation activity, it is necessary to base it on reliable sources of scientific and technical information, including patent research. However, the existing variety and scale of patent databases necessitate the development of an instrument that enables processing large volumes of patent information within limited timeframes. In these conditions, it is necessary to use machine learning (ML) technology to create a solid information base for management decisions. Methodology - The objective of the study presented in the paper was to propose an algorithm for processing patent data to improve the quality of patent research. The essence of the algorithm is that all necessary patents are ranked according to a relevance criterion, after which the researcher analyzes the already essential patents. Findings - The paper envisages the algorithm's practical realization using a gravity-driven power generator case. Findings indicate that the proposed new instrument enables a significant reduction in processing time for patent data. Novelty - The paper contributes to innovation management by integrating patent analytics and machine learning. Type of Paper - Empirical"
    Keywords: Innovation activity; patent analytics; machine learning technology; a gravity-driven power generator.
    JEL: D80 D81
    Date: 2025–12–31
    URL: https://d.repec.org/n?u=RePEc:gtr:gatrjs:jber267
  3. By: Slonimczyk, Fabian
    Abstract: I propose a relatively simple way to deploy pre-trained large language models (LLMs) in order to extract sentiment and other useful features from text data. The method, which I refer to as prompt-based sentiment extraction, offers multiple advantages over other methods used in economics and finance. In particular, it accepts the text input as is (without preprocessing) and produces a sentiment score that has a probability interpretation. Unlike other LLM-based approaches, it does not require any fine-tuning or labeled data. I apply my prompt-based strategy to a hand-collected corpus of confidential reference letters (RLs). I show that the sentiment contents of RLs are clearly reflected in job market outcomes. Candidates with higher average sentiment in their RLs perform markedly better regardless of the measure of success chosen. Moreover, I show that sentiment dispersion among letter writers negatively affects the job market candidate’s performance. I compare my sentiment extraction approach to other commonly used methods for sentiment analysis: ‘bag-of-words’ approaches, fine-tuned language models, and querying advanced chatbots. No other method can fully reproduce the results obtained by prompt-based sentiment extraction. Finally, I slightly modify the method to obtain ‘gendered’ sentiment scores (as in Eberhardt et al., 2023). I show that RLs written for female candidates emphasize ‘grindstone’ personality traits, whereas male candidates’ letters emphasize ‘standout’ traits. These gender differences negatively affect women’s job market outcomes.
    Keywords: Large language models; text data; sentiment analysis; reference letters
    JEL: C45 J16 M51
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:126675
  4. By: César Anzola Bravo; Paola Poveda
    Abstract: Food prices have consistently been one of the leading contributors to Colombia’s inflation rate. They are particularly sensitive to exogenous factors such as extreme weather events, supply chain disruptions, and global commodity price shocks, often resulting in sharp and unpredictable price fluctuations. This document pursues two main objectives. First, it aims to estimate and evaluate methods for forecasting 33 homogeneous food inflation baskets, which together constitute the total food Consumer Price Index (Food CPI), offering tools that can assist policymakers in anticipating the drivers of future inflation. This includes both traditional time series models and modern machine learning approaches. Second, it seeks to enhance the interpretability of model predictions through explainable AI techniques. To achieve this, we propose a variable lag selection algorithm to identify optimal feature-lag pairs, and employ SHAP (SHapley Additive exPlanations) values to quantify the contribution of each feature to the model’s forecast. Our findings indicate that machine learning models outperform traditional approaches in forecasting food inflation, delivering improved accuracy across most individual baskets as well as for aggregated food inflation. *****RESUMEN: Los precios de los alimentos han sido uno de los principales factores que contribuyen a la inflación en Colombia. Estos son particularmente sensibles a factores externos como choques climáticos, interrupciones en las cadenas globales de valor y choques en los precios de los productos básicos a nivel global, lo que resulta en fluctuaciones impredecibles de precios. Este documento tiene dos objetivos. En primer lugar, busca estimar y evaluar métodos para pronosticar 33 canastas homogéneas de inflación de alimentos, ofreciendo herramientas que puedan ayudar a los hacedores de política anticipar los factores que afectan la inflación de alimentos futura. Esto incluye tanto modelos tradicionales de series de tiempo como enfoques modernos de machine learning. En segundo lugar, se propone mejorar la interpretabilidad de las predicciones de los modelos mediante técnicas de explainableAI. Para ello, proponemos un algoritmo de selección de variables que identifique las variables explicativas más relevantes, y utilizamos valores SHAP (SHapley Additive exPlanations) para cuantificar la contribución de cada variable explicativa en las predicciones del modelo. Nuestros hallazgos indican que los modelos de machine learning superan a los enfoques tradicionales en el pronóstico de la inflación de alimentos, logrando una mayor precisión tanto en la mayoría de las canastas individuales como en la inflación de alimentos agregada.
    Keywords: Macroeconomic Forecasts, Food Prices, Machine learning, Pronóstico Macroeconómico, Inflación de alimentos
    JEL: C53 E31 E37
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:bdr:borrec:1335
  5. By: Arieda Mu\c{c}o
    Abstract: Using Brazilian municipal audit reports, I construct an automated corruption index that combines a dictionary of audit irregularities with principal component analysis. The index validates strongly against independent human coders, explaining 71-73 \% of the variation in hand-coded corruption counts in samples where coders themselves exhibit high agreement, and the results are robust within these validation samples. The index behaves as theory predicts, correlating with municipal characteristics that prior research links to corruption. Supervised learning alternatives yield nearly identical municipal rankings ($R^{2}=0.98$), confirming that the dictionary approach captures the same underlying construct. The method scales to the full audit corpus and offers advantages over both manual coding and Large Language Models (LLMs) in transparency, cost, and long-run replicability.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.09652
  6. By: Loreta Isaraj (IRCrES-CNR)
    Abstract: In several data-rich domains such as finance, medicine, law, and scientific publishing, most of the valuable information is embedded in unstructured textual formats, from clinical notes and legal briefs to financial statements and research papers. These sources are rarely available in structured formats suitable for immediate quantitative analysis. This presentation introduces a scalable and fully integrated workflow that employs large language models (LLMs), specifically ChatGPT 4.0 via API, in conjunction with Python and Stata to extract structured variables from unstructured documents and make them ready for further statistical processing in Stata. As a representative use case, I demonstrate the extraction of information from a SOAP clinical note, treated as a typical example of unstructured medical documentation. The process begins with a single PDF and extends to an automated pipeline capable of batch-processing multiple documents, highlighting the scalability of this approach. The workflow involves PDF parsing and text preprocessing using Python, followed by prompt engineering designed to optimize the performance of the LLM. In particular, the temperature parameter is tuned to a low value (for example, 0.0–0.3) to promote deterministic and concise extraction, minimizing variation across similar documents and ensuring consistency in output structure. Once the LLM returns structured data, typically in JSON or CSV format, it is seamlessly imported into Stata using custom.do scripts that handle parsing (insheet), transformation (split, reshape), and data cleaning. The final dataset is used for exploratory or inferential analysis, with visualization and summary statistics executed entirely within Stata. The presentation also addresses critical considerations including the computationala cost of using commercial LLM APIs (token-based billing), privacy and compliance risks when processing sensitive data (such as patient records), and the potential for bias or hallucination inherent to generative models. To assess the reliability of the extraction process, I report evaluation metrics such as cosine similarity (for text alignment and summarization accuracy) and F1-score (for evaluating named entity and numerical field extraction). By bridging the capabilities of LLMs with Stata’s powerful analysis tools, this workflow equips researchers and analysts with an accessible method to unlock structured insights from complex unstructured sources, extending the reach of empirical research into previously inaccessible text-heavy datasets.
    Date: 2025–10–01
    URL: https://d.repec.org/n?u=RePEc:boc:isug25:13
  7. By: Swati Sachan; Theo Miller; Mai Phuong Nguyen
    Abstract: High-stakes decision domains are increasingly exploring the potential of Large Language Models (LLMs) for complex decision-making tasks. However, LLM deployment in real-world settings presents challenges in data security, evaluation of its capabilities outside controlled environments, and accountability attribution in the event of adversarial decisions. This paper proposes a framework for responsible deployment of LLM-based decision-support systems through active human involvement. It integrates interactive collaboration between human experts and developers through multiple iterations at the pre-deployment stage to assess the uncertain samples and judge the stability of the explanation provided by post-hoc XAI techniques. Local LLM deployment within organizations and decentralized technologies, such as Blockchain and IPFS, are proposed to create immutable records of LLM activities for automated auditing to enhance security and trace back accountability. It was tested on Bert-large-uncased, Mistral, and LLaMA 2 and 3 models to assess the capability to support responsible financial decisions on business lending.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.04108
  8. By: Carlo Drago (Università degli Studi Niccolò Cusano)
    Abstract: This presentation shows an applied framework for text mining and clustering in the Stata environment and provides practical tools for policy-relevant research in economics and health economics. With the growing amount of unstructured textual data—from financial news and analyst reports to scientific publications— there is an increasing demand for scalable methods to classify and interpret such information for evidence-based policy and forecasting. A first relevant concept is the Stata capacity to be integrated with Python with aim to implement hierarchical clustering from scratch using TF-IDF vectorization and cosine distance. This technique is specifically applied to economic text sources—such as headlines or institutional communications—with the aim to segment documents into a fixed or silhouette- optimized number of clusters. This approach allows researchers to identify patterns on data, uncover latent themes, and organize information for macroeconomic forecasting, sentiment analysis, or real-time policy monitoring. In the second part, I focus on literature mapping in health economics. Using a curated corpus of article titles related to telemedicine and diabetes, I apply a native Stata pipeline based on text normalization and clustering to identify thematic areas within the literature. The approach promotes organized reviews in health technology assessment and policy evaluation and makes evidence synthesis more accessible. By combining native Stata capabilities with Python-enhanced workflows, I provide applied researchers with an accessible and policy-relevant toolkit for unsupervised text classification in multiple domains.
    Date: 2025–10–01
    URL: https://d.repec.org/n?u=RePEc:boc:isug25:14
  9. By: Stefano M Iacus; Giuseppe Porro
    Abstract: We analyze a novel large-scale social-media-based measure of U.S. job satisfaction, constructed by applying a fine-tuned large language model to 2.6 billion georeferenced tweets, and link it to county-level labor market conditions (2013-2023). Logistic regressions show that rural counties consistently report lower job satisfaction sentiment than urban ones, but this gap decreases under tight labor markets. In contrast to widening rural-urban income disparities, perceived job quality converges when unemployment is low, suggesting that labor market slack, not income alone, drives spatial inequality in subjective work-related well-being.
    Date: 2025–12
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2512.05144

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.