nep-big New Economics Papers
on Big Data
Issue of 2025–11–24
fourteen papers chosen by
Tom Coupé, University of Canterbury


  1. Directional Forecasts for Yields Using Econometric Models and Machine Learning Methods By Sotiris; Tsolacos; Tatiana Franus
  2. "It Looks All the Same to Me": Cross-index Training for Long-term Financial Series Prediction By Stanislav Selitskiy
  3. Measuring economic outlook in the news timely and efficiently By Elliot Beck; Franziska Eckert; Linus K\"uhne; Helge Liebert; Rina Rosenblatt-Wisch
  4. Model selection for inhomogeneous real estate market data in Germany. By Matthias Soot; Sabine Horvath; Danielle Warstat; Hans-Berndt Neuner; Alexandra Weitkamp
  5. Measuring University Contributions to the Sustainable Development Goals: An NLP-Based Assessment Framework By Phoebe Koundouri; Conrad Landis; Stathis Devves; Theofanis Zacharatos; Georgios Feretzakis
  6. Misaligned by Design: Incentive Failures in Machine Learning By David Autor; Andrew Caplin; Daniel Martin; Philip Marx
  7. Predicting House Price Indices: A Machine Learning Approach Using Linked Listing and Transaction Data By Jan Schmid; He Cheng; Francisco Amaral; Zdrzalek Jonas
  8. An extreme Gradient Boosting (XGBoost) Trees approach to Detect and Identify Unlawful Insider Trading (UIT) Transactions By Krishna Neupane; Igor Griva
  9. Information Extraction from Fiscal Documents using LLMs By Vikram Aggarwal; Jay Kulkarni; Aakriti Narang; Aditi Mascarenhas; Siddarth Raman; Ajay Shah; Susan Thomas
  10. Fundamentals or Noise? The Informative Value of Twitter Sentiment for REIT Returns By Lukas Lautenschlaeger; Sophia Bodensteiner; Julia Freybote; Wolfgang Schäfers
  11. Pattern Recognition of Scrap Plastic Misclassification in Global Trade Data By Muhammad Sukri Bin Ramli
  12. Forecasting Macro with Finance By Bachmair, K.; Schmitz, N.
  13. Measuring and Mitigating Racial Disparities in Large Language Model Mortgage Underwriting By Don; S. Bowen; McKay Price; Luke Stein; Ke Yang
  14. Generative Agents and Expectations: Do LLMs Align with Heterogeneous Agent Models? By Filippo Gusella; Eugenio Vicario

  1. By: Sotiris; Tsolacos; Tatiana Franus
    Abstract: In this paper, we evaluate the performance of various methodologies for forecasting real estate yields. Expected yield changes are a crucial input for valuations and investment strategies. We conduct a comparative study to assess the forecast accuracy of econometric and time series models relative to machine learning algorithms. Our target series include net initial and equivalent yields across key real estate sectors: office, industrial, and retail. The analysis is based on monthly UK data, though the framework can be applied to different contexts, including quarterly data. The econometric and time series models considered include ARMA, ARMAX, stepwise regression, and VAR family models, while the machine learning methods encompass Random Forest, XGBoost, Decision Tree, Gradient Boosting and Support Vector Machines. We utilise a comprehensive set of economic, financial, and survey data to predict yield movements and evaluate forecast performance over three-, six-, and twelve-month horizons. While conventional forecast metrics are calculated, our primary focus is on directional forecasting. The findings have significant practical implications. By capturing directional changes, our assessment aids price discovery in real estate markets. Given that private-market real estate data are reported with a lag - even for monthly data - early signals of price movements are valuable for investors and lenders. This study aims to identify the most successful methods to gauge forthcoming yield movements.
    Keywords: directional forecasting; econometric models; Machine Learning; property yields
    JEL: R3
    Date: 2025–01–01
    URL: https://d.repec.org/n?u=RePEc:arz:wpaper:eres2025_269
  2. By: Stanislav Selitskiy
    Abstract: We investigate a number of Artificial Neural Network architectures (well-known and more ``exotic'') in application to the long-term financial time-series forecasts of indexes on different global markets. The particular area of interest of this research is to examine the correlation of these indexes' behaviour in terms of Machine Learning algorithms cross-training. Would training an algorithm on an index from one global market produce similar or even better accuracy when such a model is applied for predicting another index from a different market? The demonstrated predominately positive answer to this question is another argument in favour of the long-debated Efficient Market Hypothesis of Eugene Fama.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.08658
  3. By: Elliot Beck; Franziska Eckert; Linus K\"uhne; Helge Liebert; Rina Rosenblatt-Wisch
    Abstract: We introduce a novel indicator that combines machine learning and large language models with traditional statistical methods to track sentiment regarding the economic outlook in Swiss news. The indicator is interpretable and timely, and it significantly improves the accuracy of GDP growth forecasts. Our approach is resource-efficient, modular, and offers a way of benefitting from state-of-the-art large language models even if data are proprietary and cannot be stored or analyzed on external infrastructure - a restriction faced by many central banks and public institutions.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.04299
  4. By: Matthias Soot; Sabine Horvath; Danielle Warstat; Hans-Berndt Neuner; Alexandra Weitkamp
    Abstract: Analyzing the real estate market using modern machine learning (ML) methods is increasingly becoming a common approach. The variables (factors) influencing the real estate market (purchase price or value) behave non-linearly often. For this reason, the ML-methods seem to outperform the previously established linear regression models – especially in modelling bigger datasets from large spatial submarkets or long timespans. However, many approaches found in the literature use the same influencing parameters known from the multiple linear regression models for the new non-parametric approaches. It remains unclear whether there are further influencing variables that only prove significant in a non-linear model. The selection of influencing factors is understood here as model selection: In this work, we investigate model selection approaches on inhomogeneous German real estate transaction data from Brandenburg, Saxony and Lower Saxony. The aim of the research is an improved automatization in the context of model selection starting from raw data. As functional submarket, we aggregate multi-family houses with apartments to increase the sample size. The dataset has several data gaps in explaining parameters e.g. living space. Furthermore, the influencing variables differ between apartments and multi-family houses. We are therefore developing a method to model this inhomogeneity in a single approach (e.g. factor analysis). We consider Artificial Neural Networks (ANN), Random Forest (RF) and Gradient Boosting (GB) as ML-models for which the model selection is performed. We compare the found parameters with classical model selection used for a linear approach.
    Keywords: Germany; Machine-Learning; Model-selection
    JEL: R3
    Date: 2025–01–01
    URL: https://d.repec.org/n?u=RePEc:arz:wpaper:eres2025_245
  5. By: Phoebe Koundouri; Conrad Landis; Stathis Devves; Theofanis Zacharatos; Georgios Feretzakis
    Abstract: Systematic assessment of university contributions to the United Nations Sustainable Development Goals (SDGs) remains challenging due to the lack of standardized, scalable evaluation frameworks. This paper introduces a comprehensive four-pillar assessment framework combining advanced natural language processing and machine learning techniques with qualitative analysis to evaluate university engagement across Research, Education, Organizational Governance, and External Leadership dimensions. We demonstrate the framework's application through an empirical case study of Athens University of Economics and Business (AUEB), analyzing 870 working papers, educational curricula, organizational policies, and partnership activities. The automated content analysis reveals strong alignment with institutional and partnership-oriented goals (SDG 16: 99% coverage, SDG 17: 95.8%), economic development goals (SDG 8: 80.7%, SDG 9: 80.1%), and gender equality (SDG 5: 81.4%), while identifying significant gaps in environmental SDGs. The framework's multi-method approach, combining zero-shot classification, semantic similarity, named entity recognition, pattern matching, and topic modeling, provides reliable and transparent assessment suitable for replication across diverse institutional contexts. This replicable methodology enables universities worldwide to systematically evaluate their SDG contributions, identify strategic priorities, and enhance accountability to sustainable development commitments.
    Keywords: Sustainable Development Goals, Natural Language Processing, Machine Learning, Higher Education Assessment, University Performance Measurement, Text Mining, Semantic Analysis, Research Evaluation, Computational Social Science
    Date: 2025–11–17
    URL: https://d.repec.org/n?u=RePEc:aue:wpaper:2562
  6. By: David Autor; Andrew Caplin; Daniel Martin; Philip Marx
    Abstract: The cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Because of this, artificial intelligence (AI) models used to assist such decisions are frequently trained with asymmetric loss functions that incorporate human decision-makers' trade-offs between false positives and false negatives. In two focal applications, we show that this standard alignment practice can backfire. In both cases, it would be better to train the machine learning model with a loss function that ignores the human's objective and then adjust predictions ex post according to that objective. We rationalize this result using an economic model of incentive design with endogenous information acquisition. The key insight from our theoretical framework is that machine classifiers perform not one but two incentivized tasks: choosing how to classify and learning how to classify. We show that while the adjustments engineers use correctly incentivize choosing, they can simultaneously reduce the incentives to learn. Our formal treatment of the problem reveals that methods embraced for their intuitive appeal can in fact misalign human and machine objectives in predictable ways.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.07699
  7. By: Jan Schmid; He Cheng; Francisco Amaral; Zdrzalek Jonas
    Abstract: This study proposes the conception of a real estate market forecasting model for the research area of Frankfurt am Main, utilising sophisticated AI algorithms to predict house price indices. The model integrates two primary datasets: listing data from ImmoScout24 and transaction data from the local expert committee. These datasets were linked using a threshold optimisation approach to ensure accurate matching of listings and transactions at the object level. A comprehensive review of prior studies was conducted to select key predictors, supplemented by novel variables that measure differences between listing and transaction data, such as price differences and time on market. The Random Forest based algorithm selection process involved a meta-learning approach of 54 prior studies, adapted to the final dataset's structure. The XGBoost model was selected as the most suitable algorithm, achieving a Mean Absolute Percentage Error (MAPE) of 1.76% and a Root Mean Square Error (RMSE) of 2.43 on the testing dataset. The methodology also incorporated macroeconomic and socio-economic indicators, with data structured into spatial-temporal grids for quarterly forecasting. The model demonstrated high predictive accuracy, offering valuable insights for real estate market analysis and future decision-making. Subsequent research is planned to validate the model and apply it to additional urban regions, initially focusing on the seven largest cities in Germany.
    Keywords: Forecasting; House Price Indices; Real Estate Market; XGBoost
    JEL: R3
    Date: 2025–01–01
    URL: https://d.repec.org/n?u=RePEc:arz:wpaper:eres2025_84
  8. By: Krishna Neupane; Igor Griva
    Abstract: Corporate insiders have control of material non-public preferential information (MNPI). Occasionally, the insiders strategically bypass legal and regulatory safeguards to exploit MNPI in their execution of securities trading. Due to a large volume of transactions a detection of unlawful insider trading becomes an arduous task for humans to examine and identify underlying patterns from the insider's behavior. On the other hand, innovative machine learning architectures have shown promising results for analyzing large-scale and complex data with hidden patterns. One such popular technique is eXtreme Gradient Boosting (XGBoost), the state-of-the-arts supervised classifier. We, hence, resort to and apply XGBoost to alleviate challenges of identification and detection of unlawful activities. The results demonstrate that XGBoost can identify unlawful transactions with a high accuracy of 97 percent and can provide ranking of the features that play the most important role in detecting fraudulent activities.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.08306
  9. By: Vikram Aggarwal (Google); Jay Kulkarni (xKDR Forum); Aakriti Narang (xKDR Forum); Aditi Mascarenhas (xKDR Forum); Siddarth Raman (xKDR Forum); Ajay Shah (xKDR Forum); Susan Thomas (xKDR Forum)
    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to large annual fiscal documents from the State of Karnataka in India, our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. Traditional OCR methods work poorly with errors that are hard to detect. The inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.
    JEL: H6 H7 Y10
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:anf:wpaper:43
  10. By: Lukas Lautenschlaeger; Sophia Bodensteiner; Julia Freybote; Wolfgang Schäfers
    Abstract: Twitter is established as a major platform for sharing information and opinions online. Previous research has demonstrated a connection between Twitter-expressed market sentiment and financial markets, including the U.S. REIT market. This study builds on existing literature by investigating the economic and real estate-related factors that shape Twitter sentiment and examining how its rational and irrational components differentially affect market dynamics. Given the nature of Twitter messages, comprehensive natural language processing is applied to clean and identify relevant posts and to provide the foundation for extracting the sentiment. The complex linguistic features of the given informal language are handled using a large language model. Preliminary results suggest that the rational component of social media sentiment holds increased predictive value for market trends in periods where a higher share of professional investors is active on social media, like the recent COVID-19 pandemic. On the contrary, the findings indicate that irrationality, which cannot be explained by the market itself, holds more explanatory power when mostly private investors are active on Twitter.
    Keywords: REIT; Social media sentiment; Textual Analysis
    JEL: R3
    Date: 2025–01–01
    URL: https://d.repec.org/n?u=RePEc:arz:wpaper:eres2025_150
  11. By: Muhammad Sukri Bin Ramli
    Abstract: We propose an interpretable machine learning framework to help identify trade data discrepancies that are challenging to detect with traditional methods. Our system analyzes trade data to find a novel inverse price-volume signature, a pattern where reported volumes increase as average unit prices decrease. The model achieves 0.9375 accuracy and was validated by comparing large-scale UN data with detailed firm-level data, confirming that the risk signatures are consistent. This scalable tool provides customs authorities with a transparent, data-driven method to shift from conventional to priority-based inspection protocols, translating complex data into actionable intelligence to support international environmental policies.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.08638
  12. By: Bachmair, K.; Schmitz, N.
    Abstract: While financial markets are known to contain information about future economic developments, the channels through which asset prices enhance macroeconomic forecastability remain insufficiently understood. We develop a structured set of like-for-like experiments to isolate which data and model properties drive forecasting power. Using U.S. data on inflation, industrial production, unemployment and equity returns, we test eight hypotheses along two dimensions: the contribution of financial data given different estimation methods and model classes, and the role of model choice given different financial inputs. Data aspects include cross-sectional granularity, intra-period frequency, and real-time, revisionless availability; model aspects include sparsity, direct versus indirect specification, nonlinearity, and state dependence on volatile periods. We find that financial data can deliver consistent and economically meaningful gains, but only under suitable modeling choices: Random Forest most reliably extracts useful signals, whereas an unregularised VAR often fails to do so; by contrast, expanding the financial information set along granularity, frequency, or real-time dimensions yields little systematic benefit. Gains strengthen somewhat under elevated policy uncertainty, especially for inflation, but are otherwise fragile. The analysis clarifies how data and model choices interact and provides practical guidance for forecasters on when and how to use financial inputs.
    Keywords: Macroeconomic Forecasting, Stock Returns, Hypothesis Testing, Machine Learning, Regularisation, Vector Autoregressions, Ridge Regression, Lasso, Random Forests, Support Vector Regression, Elastic Net, Principal Component Analysis, Neural Networks
    JEL: C32 C45 C53 C58 E27 E37 E44 G17
    Date: 2025–11–13
    URL: https://d.repec.org/n?u=RePEc:cam:camdae:2574
  13. By: Don; S. Bowen; McKay Price; Luke Stein; Ke Yang
    Abstract: We conduct the first study exploring the application of large language models (LLMs) to mortgage underwriting, using an audit study design that combines real loan application data with experimentally manipulated race and credit scores. First, we find that LLMs systematically recommend more denials and higher interest rates for Black applicants than otherwise-identical white applicants. These racial disparities are largest for lower-credit-score applicants and riskier loans, and exist across multiple generations of LLMs developed by three leading firms. Second, we identify a straightforward and effective mitigation strategy: Simply instructing the LLM to make unbiased decisions. Doing so eliminates the racial approval gap and significantly reduces interest rate disparities. Finally, we show LLM recommendations correlate strongly with realworld lender decisions, even without fine-tuning, specialized training, macroeconomic context, or extensive application data. Our findings have important implications for financial firms exploring LLM applications and regulators overseeing AI’s rapidly expanding role in finance.
    Keywords: Artificial Intelligence; Fair Lending; Mortgage Underwriting; Racial Bias
    JEL: R3
    Date: 2025–01–01
    URL: https://d.repec.org/n?u=RePEc:arz:wpaper:eres2025_75
  14. By: Filippo Gusella; Eugenio Vicario
    Abstract: Results in the Heterogeneous Agent Model (HAM) literature determine the proportion of fundamentalists and trend followers in the financial market. This proportion varies according to the periods analyzed. In this paper, we use a large language model (LLM) to construct a generative agent (GA) that determines the probability of adopting one of the two strategies based on current information. The probabilities of strategy adoption are compared with those in the HAM literature for the S\&P 500 index between 1990 and 2020. Our findings suggest that the resulting artificial intelligence (AI) expectations align with those reported in the HAM literature. At the same time, extending the analysis to artificial market data helps us to filter the decision-making process of the AI agent. In the artificial market, results confirm the heterogeneity in expectations but reveal systematic asymmetry toward the fundamentalist behavior.
    Date: 2025–11
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2511.08604

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.