nep-big New Economics Papers
on Big Data
Issue of 2026–03–30
ten papers chosen by
Tom Coupé, University of Canterbury


  1. Generalized Stock Price Prediction for Multiple Stocks Combined with News Fusion By Pei-Jun Liao; Hung-Shin Lee; Yao-Fei Cheng; Li-Wei Chen; Hung-yi Lee; Hsin-Min Wang
  2. Causality Elicitation from Large Language Models By Takashi Kameyama; Masahiro Kato; Yasuko Hio; Yasushi Takano; Naoto Minakawa
  3. At-Risk Transformation for U.S. Recession Prediction By Rahul Billakanti; Minchul Shin
  4. Enhancing the Accuracy of Regional Input-Output Table Estimation: A Deep Learning Approach By Shogo Fukui
  5. Machine Learning techniques for synthetic data generation in Energy and Financial Markets By Oleksandr Castello; Marco Corazza
  6. Targeting of food aid programs: Evidence from Egypt By Mahmoud, Mai; Kurdi, Sikandra
  7. Targeting of food aid programs: Evidence from Egypt By Mahmoud, Mai; Kurdi, Sikandra
  8. From Natural Language to Executable Option Strategies via Large Language Models By Haochen Luo; Zhengzhao Lai; Junjie Xu; Yifan Li; Tang Pok Hin; Yuan Zhang; Chen Liu
  9. Investor risk profiles of large language models By Hanyong Cho; Geumil Bae; Jang Ho Kim
  10. Machines acquire scientific taste from institutional traces By Ziqin Gong; Ning Li; Huaikang Zhou

  1. By: Pei-Jun Liao; Hung-Shin Lee; Yao-Fei Cheng; Li-Wei Chen; Hung-yi Lee; Hsin-Min Wang
    Abstract: Predicting stock prices presents challenges in financial forecasting. While traditional approaches such as ARIMA and RNNs are prevalent, recent developments in Large Language Models (LLMs) offer alternative methodologies. This paper introduces an approach that integrates LLMs with daily financial news for stock price prediction. To address the challenge of processing news data and identifying relevant content, we utilize stock name embeddings within attention mechanisms. Specifically, we encode news articles using a pre-trained LLM and implement three attention-based pooling techniques -- self-attentive, cross-attentive, and position-aware self-attentive pooling -- to filter news based on stock relevance. The filtered news embeddings, combined with historical stock prices, serve as inputs to the prediction model. Unlike prior studies that focus on individual stocks, our method trains a single generalized model applicable across multiple stocks. Experimental results demonstrate a 7.11% reduction in Mean Absolute Error (MAE) compared to the baseline, indicating the utility of stock name embeddings for news filtering and price forecasting within a generalized framework.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.19286
  2. By: Takashi Kameyama; Masahiro Kato; Yasuko Hio; Yasushi Takano; Naoto Minakawa
    Abstract: Large language models (LLMs) are trained on enormous amounts of data and encode knowledge in their parameters. We propose a pipeline to elicit causal relationships from LLMs. Specifically, (i) we sample many documents from LLMs on a given topic, (ii) we extract an event list from from each document, (iii) we group events that appear across documents into canonical events, (iv) we construct a binary indicator vector for each document over canonical events, and (v) we estimate candidate causal graphs using causal discovery methods. Our approach does not guarantee real-world causality. Rather, it provides a framework for presenting the set of causal hypotheses that LLMs can plausibly assume, as an inspectable set of variables and candidate graphs.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.04276
  3. By: Rahul Billakanti; Minchul Shin
    Abstract: We propose a simple binarization of predictors, an "at-risk" transformation, as an alternative to the standard practice of using continuous, standardized variables in recession forecasting models. By converting predictors into indicators of unusually weak states based on a thresholding rule estimated from training data, we demonstrate their ability to capture the discrete nature of rare events such as U.S. recessions. Using a large panel of monthly U.S. macroeconomic and financial data, we show that binarized predictors consistently improve out-of-sample forecasting performance, often making linear models competitive with flexible machine learning methods, and that the gains are particularly pronounced around the onset of recessions.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.07813
  4. By: Shogo Fukui
    Abstract: Non-survey methods have been developed and applied for estimating regional input-output tables. However, there is an ongoing debate about the assumptions necessary for these methods and their accuracy. To address these issues, this study presents a deep learning method for estimating regional input-output tables. First, the quantitative economic data for regions is augmented by linear combinations. Then, deep learning is performed on each item in the input-output table, treating these items as target variables. Finally, regional input-output tables are estimated through matrix balancing to the predicted values from the trained model. The estimation accuracy of this method is verified using the 2015 input-output table for Japan as a benchmark. Compared to matrix balancing under the ideal assumption of known row and column sums, our method generally demonstrates higher estimation accuracy. Thus, this method is anticipated to provide a foundation for deriving more precise estimates of regional input-output tables.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.13823
  5. By: Oleksandr Castello (Ca’ Foscari University of Venice); Marco Corazza (Ca’ Foscari University of Venice)
    Abstract: The availability of sufficiently large, reliable, and high-quality datasets represents a fundamental prerequisite for quantitative analysis and data-driven decision-making in economics and finance. In practice, however, financial data are often limited, noisy, or subject to restricted access, creating significant empirical constraints for both researchers and practitioners. Recent advances in Generative Machine Learning (GenML) provide promising tools to overcome these limitations by enabling the generation of synthetic data capable of preserving the main statistical features of original data. Despite the rapid diffusion of these techniques, most existing studies focus on replicating stylized facts of financial time series or producing forward-looking simulations, while less attention has been devoted to a systematic assessment of the generative fidelity and generalization capacity of alternative models across different distributional environments. Motivated by this gap, this study provides a comparative evaluation of several Deep Generative Machine Learning (Deep-GenML) families by assessing their ability to reproduce both theoretical statistical distributions and empirical financial and commodity market data. The analysis spans multiple Deep-GenML architectures, distributional settings and market regimes, while also examining model performance under alternative training configurations that reflect varying degrees of data availability. The empirical evidence indicates that deep generative models are capable of accurately reproducing complex distributional features—including heavy tails, asymmetry, and multimodality—across a wide range of scenarios. Overall, the results highlight the potential of deep generative approaches as flexible tools for synthetic data generation and distributional modeling in financial and energy market applications.
    Keywords: Deep Generative Machine Learning, Synthetic data generation, GAN, VAE, EBM, Financial and Energy market data
    JEL: C45 C46 C58 C63
    Date: 2026
    URL: https://d.repec.org/n?u=RePEc:ven:wpaper:2026:11
  6. By: Mahmoud, Mai; Kurdi, Sikandra
    Abstract: In-kind food aid programs remain prominent world-wide. Targeting in these programs is complex due to potential distortions in consumption. This paper advances the literature by moving beyond poverty-based targeting to address nutritional objectives. Using data from a randomized controlled trial (RCT), we apply machine learning (ML) techniques to analyze heterogeneity in impacts across nutritional outcomes, aiming to inform targeting based on observable characteristics. We find that such characteristics significantly predict heterogeneity in treatment effects, though relevant predictors differ by outcome and treatment type. Building on recent literature advocating for balancing of deprivation and expected impact, we show that, in our context, the trade-off between targeting the most impacted versus the most deprived households is limited. Instead, the main challenge is prioritizing among competing nutritional objectives. Our findings indicate that ML methods can inform outcome-specific targeting criteria, though these criteria vary across outcomes and are imperfectly correlated.
    Keywords: nutrition; econometric models; food aid; machine learning; targeting; Egypt; Northern Africa
    Date: 2025–12–31
    URL: https://d.repec.org/n?u=RePEc:fpr:ifprid:179370
  7. By: Mahmoud, Mai; Kurdi, Sikandra
    Abstract: In-kind food aid programs remain prominent world-wide. Targeting in these programs is complex due to potential distortions in consumption. This paper advances the literature by moving beyond poverty-based targeting to address nutritional objectives. Using data from a randomized controlled trial (RCT), we apply machine learning (ML) techniques to analyze heterogeneity in impacts across nutritional outcomes, aiming to inform targeting based on observable characteristics. We find that such characteristics significantly predict heterogeneity in treatment effects, though relevant predictors differ by outcome and treatment type. Building on recent literature advocating for balancing of deprivation and expected impact, we show that, in our context, the trade-off between targeting the most impacted versus the most deprived households is limited. Instead, the main challenge is prioritizing among competing nutritional objectives. Our findings indicate that ML methods can inform outcome-specific targeting criteria, though these criteria vary across outcomes and are imperfectly correlated.
    Keywords: nutrition; econometric models; food aid; machine learning; targeting; Egypt; Africa; Northern Africa
    Date: 2025–12–31
    URL: https://d.repec.org/n?u=RePEc:fpr:gsspwp:179370
  8. By: Haochen Luo; Zhengzhao Lai; Junjie Xu; Yifan Li; Tang Pok Hin; Yuan Zhang; Chen Liu
    Abstract: Large Language Models (LLMs) excel at general code generation, yet translating natural-language trading intents into correct option strategies remains challenging. Real-world option design requires reasoning over massive, multi-dimensional option chain data with strict constraints, which often overwhelms direct generation methods. We introduce the Option Query Language (OQL), a domain-specific intermediate representation that abstracts option markets into high-level primitives under grammatical rules, enabling LLMs to function as reliable semantic parsers rather than free-form programmers. OQL queries are then validated and executed deterministically by an engine to instantiate executable strategies. We also present a new dataset for this task and demonstrate that our neuro-symbolic pipeline significantly improves execution accuracy and logical consistency over direct baselines.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.16434
  9. By: Hanyong Cho; Geumil Bae; Jang Ho Kim
    Abstract: This paper investigates how large language models (LLMs) form and express investor risk profiles, a critical component of retail investment advising. We examine three LLMs (GPT, Gemini, and Llama) and assess their responses to a standardized risk questionnaire under varying prompts. In particular, we establish each model's default investment profile by analyzing repeated responses per model. We observe that LLMs are generally longterm investors but exhibit different tendencies in risk tolerance: Gemini has a moderate risk level with highly consistent responses, Llama skews more conservative, and GPT appears moderately aggressive with the greatest variation in answers. Moreover, we find that assigning specific personas such as age, wealth, and investment experience leads each LLM to adjust its risk profile, although the extent of these adjustments differs across the models.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.09303
  10. By: Ziqin Gong; Ning Li; Huaikang Zhou
    Abstract: Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI's reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.
    Date: 2026–03
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2603.16659

This nep-big issue is ©2026 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.