nep-big New Economics Papers
on Big Data
Issue of 2025–10–27
seventeen papers chosen by
Tom Coupé, University of Canterbury


  1. Enhancing OHLC Data with Timing Features: A Machine Learning Evaluation By Ruslan Tepelyan
  2. Deep learning CAT bond valuation By Julian Sester; Huansang Xu
  3. Predicting Regional Unemployment in the EU By Paglialunga, Elena; Resce, Giuliano; Zanoni, Angela
  4. Essays on Consistency and Randomization in Machine Learning and Fraud Detection By Revelas, Christos
  5. Application of Deep Reinforcement Learning to At-the-Money S&P 500 Options Hedging By Zofia Bracha; Jakub Michańków; Paweł Sakowski
  6. Public communication and collusion: New screening tools for competition authorities By Duso, Tomaso; Harrington, Joseph E.; Kreuzberg, Carl; Sapi, Geza
  7. Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction By Tian Guo; Emmanuel Hauptmann
  8. Empirical Corporate Finance and Deep Learning By Liu, Zihao
  9. RegimeFolio: A Regime Aware ML System for Sectoral Portfolio Optimization in Dynamic Markets By Yiyao Zhang; Diksha Goel; Hussain Ahmad; Claudia Szabo
  10. Sensitivity analysis for treatment effects in difference-in-differences models using Riesz Rrepresentation By Bach, Philipp; Klaaßen, Sven; Kueck, Jannis; Mattes, Mara; Spindler, Martin
  11. DeepAries: Adaptive Rebalancing Interval Selection for Enhanced Portfolio Selection By Jinkyu Kim; Hyunjung Yi; Mogan Gim; Donghee Choi; Jaewoo Kang
  12. Management Practices and Firm Performance During the Great Recession By Florian Englmaier; Jose E. Galdon-Sanchez; Ricard Gil; Michael Kaiser; Helene Strandt
  13. Firm characteristics of two-way traders: Evidence from Probit vs. Kernel-Regularized Least Squares regressions By Joachim Wagner
  14. Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data By Meisenbacher, Stephen; Nestorov, Svetlozar; Norlander, Peter
  15. Technology and Labor Markets: Past, Present, and Future; Evidence from Two Centuries of Innovation By Huben Liu; Dimitris Papanikolaou; Lawrence D.W. Schmidt; Bryan Seegmiller
  16. Using Natural Language Processing to Identify Monetary Policy Shocks By Alexandra Piller; Marc Schranz; Larissa Schwaller
  17. Large Language Models and Futures Price Factors in China By Yuhan Cheng; Heyang Zhou; Yanchu Liu

  1. By: Ruslan Tepelyan
    Abstract: OHLC bar data is a widely used format for representing financial asset prices over time due to its balance of simplicity and informativeness. Bloomberg has recently introduced a new bar data product that includes additional timing information-specifically, the timestamps of the open, high, low, and close prices within each bar. In this paper, we investigate the impact of incorporating this timing data into machine learning models for predicting volume-weighted average price (VWAP). Our experiments show that including these features consistently improves predictive performance across multiple ML architectures. We observe gains across several key metrics, including log-likelihood, mean squared error (MSE), $R^2$, conditional variance estimation, and directional accuracy.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.16137
  2. By: Julian Sester; Huansang Xu
    Abstract: In this paper, we propose an alternative valuation approach for CAT bonds where a pricing formula is learned by deep neural networks. Once trained, these networks can be used to price CAT bonds as a function of inputs that reflect both the current market conditions and the specific features of the contract. This approach offers two main advantages. First, due to the expressive power of neural networks, the trained model enables fast and accurate evaluation of CAT bond prices. Second, because of its fast execution the trained neural network can be easily analyzed to study its sensitivities w.r.t. changes of the underlying market conditions offering valuable insights for risk management.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.25899
  3. By: Paglialunga, Elena; Resce, Giuliano; Zanoni, Angela
    Abstract: This paper predicts regional unemployment in the European Union by applying machine learning techniques to a dataset covering 198 NUTS-2 regions, 2000 to 2019. Tree-based models substantially outperform traditional regression approaches for this task, while accommodating reinforcement effects and spatial spillovers as determinants of regional labor market outcomes. Inflation—particularly energy-related—emerges as a critical predictor, highlighting vulnerabilities to energy shocks and green transition policies. Environmental policy stringency and eco-innovation capacity also prove significant. Our findings demonstrate the potential of machine learning to support proactive, place-sensitive interventions, aiming to predict and mitigate the uneven socioeconomic impacts of structural change across regions.
    Keywords: Regional unemployment; Inflation; Environmental policy; Spatial spillovers; Machine learning.
    JEL: E24 J64 Q52 R23
    Date: 2025–10–15
    URL: https://d.repec.org/n?u=RePEc:mol:ecsdps:esdp25101
  4. By: Revelas, Christos (Tilburg University, School of Economics and Management)
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:tiu:tiutis:2ee1c0cb-ed62-441e-9bf4-103e06ff245f
  5. By: Zofia Bracha (Faculty of Economic Sciences, University of Warsaw); Jakub Michańków (TripleSun, Krakow); Paweł Sakowski (Faculty of Economic Sciences, University of Warsaw)
    Abstract: This paper explores the application of deep Q-learning to hedging at-the-money options on the S&P 500 index. We develop an agent based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, trained to simulate hedging decisions without making explicit model assumptions on price dynamics. The agent was trained on historical intraday prices of S&P 500 call options across years 2004 to 2024, using a single time series of six predictor variables: option price, underlying asset price, moneyness, time to maturity, realized volatility, and current hedge position. A walk-forward procedure was applied for training, which lead to nearly 17 years of out-of-sample evaluation. The performance of the deep reinforcement learning (DRL) agent is benchmarked against the Black–Scholes delta hedging strategy over the same time period. We assess both approaches using metrics such as annualized return, volatility, information ratio, and Sharpe ratio. To test models’ adaptability, we performed simulations across varying market conditions and added constraints such as transaction costs and risk-awareness penalties. Our results show that the DRL agent can outperform traditional hedging methods, particularly in volatile or high-cost environments, highlighting its robustness and flexibility in practical trading contexts. While the agent consistently outperforms delta hedging, its performance deteriorates when the risk-awareness parameter is higher. We also observed that the longer the time interval used for volatility estimation, the more stable the results.
    Keywords: Deep learning, Reinforcement learning, Double Deep Q-netwoorks, options market, options hedging, deep hedging
    JEL: C4 C14 C45 C53 C58 G13
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:war:wpaper:2025-25
  6. By: Duso, Tomaso; Harrington, Joseph E.; Kreuzberg, Carl; Sapi, Geza
    Abstract: Competition authorities increasingly rely on economic screening tools to identify markets where firms deviate from competitive norms. Traditional screening methods assume that collusion occurs through secret agreements. However, recent research highlights that firms can use public announcements to coordinate decisions, reducing competition while avoiding detection. We propose a novel approach to screening for collusion in public corporate statements. Using natural language processing, we analyze more than 300, 000 earnings call transcripts issued worldwide between 2004 and 2022. By identifying expressions commonly associated with collusion, our method provides competition authorities with a tool to detect potentially anticompetitive behavior in public communications. Our approach can extend beyond earnings calls to other sources, such as news articles, trade press, and industry reports. Our method informed the European Commission's 2024 unannounced inspections in the car tire sector, prompted by concerns over price coordination through public communication.
    Keywords: Communication, Collusion, NLP, Screening, Text Analysis
    JEL: C23 D22 L1 L4 L64
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:zbw:dicedp:329636
  7. By: Tian Guo; Emmanuel Hauptmann
    Abstract: In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.15691
  8. By: Liu, Zihao (Tilburg University, School of Economics and Management)
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:tiu:tiutis:440468a5-5c38-4aab-ba29-b83a255fac6a
  9. By: Yiyao Zhang; Diksha Goel; Hussain Ahmad; Claudia Szabo
    Abstract: Financial markets are inherently non-stationary, with shifting volatility regimes that alter asset co-movements and return distributions. Standard portfolio optimization methods, typically built on stationarity or regime-agnostic assumptions, struggle to adapt to such changes. To address these challenges, we propose RegimeFolio, a novel regime-aware and sector-specialized framework that, unlike existing regime-agnostic models such as DeepVol and DRL optimizers, integrates explicit volatility regime segmentation with sector-specific ensemble forecasting and adaptive mean-variance allocation. This modular architecture ensures forecasts and portfolio decisions remain aligned with current market conditions, enhancing robustness and interpretability in dynamic markets. RegimeFolio combines three components: (i) an interpretable VIX-based classifier for market regime detection; (ii) regime and sector-specific ensemble learners (Random Forest, Gradient Boosting) to capture conditional return structures; and (iii) a dynamic mean-variance optimizer with shrinkage-regularized covariance estimates for regime-aware allocation. We evaluate RegimeFolio on 34 large cap U.S. equities from 2020 to 2024. The framework achieves a cumulative return of 137 percent, a Sharpe ratio of 1.17, a 12 percent lower maximum drawdown, and a 15 to 20 percent improvement in forecast accuracy compared to conventional and advanced machine learning benchmarks. These results show that explicitly modeling volatility regimes in predictive learning and portfolio allocation enhances robustness and leads to more dependable decision-making in real markets.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.14986
  10. By: Bach, Philipp; Klaaßen, Sven; Kueck, Jannis; Mattes, Mara; Spindler, Martin
    Abstract: Difference-in-differences (DiD) is one of the most popular approaches for empirical research in economics, political science, and beyond. Identification in these models is based on the conditional parallel trends assumption: In the absence of treatment, the average outcome of the treated and untreated group are assumed to evolve in parallel over time, conditional on pre-treatment covariates. We introduce a novel approach to sensitivity analysis for DiD models that assesses the robustness of DiD estimates to violations of this assumption due to unobservable confounders, allowing researchers to transparently assess and communicate the credibility of their causal estimation results. Our method focuses on estimation by Double Machine Learning and extends previous work on sensitivity analysis based on Riesz Representation in cross-sectional settings. We establish asymptotic bounds for point estimates and confidence intervals in the canonical 2 × 2 setting and group-time causal parameters in settings with staggered treatment adoption. Our approach makes it possible to relate the formulation of parallel trends violation to empirical evidence from (1) pre-testing, (2) covariate benchmarking and (3) standard reporting statistics and visualizations. We provide extensive simulation experiments demonstrating the validity of our sensitivity approach and diagnostics and apply our approach to two empirical applications.
    Keywords: Sensitivity Analysis, Difference-in-differences, Double Machine Learning, Riesz Representation, Causal Inference
    Date: 2025
    URL: https://d.repec.org/n?u=RePEc:zbw:fubsbe:330188
  11. By: Jinkyu Kim; Hyunjung Yi; Mogan Gim; Donghee Choi; Jaewoo Kang
    Abstract: We propose DeepAries , a novel deep reinforcement learning framework for dynamic portfolio management that jointly optimizes the timing and allocation of rebalancing decisions. Unlike prior reinforcement learning methods that employ fixed rebalancing intervals regardless of market conditions, DeepAries adaptively selects optimal rebalancing intervals along with portfolio weights to reduce unnecessary transaction costs and maximize risk-adjusted returns. Our framework integrates a Transformer-based state encoder, which effectively captures complex long-term market dependencies, with Proximal Policy Optimization (PPO) to generate simultaneous discrete (rebalancing intervals) and continuous (asset allocations) actions. Extensive experiments on multiple real-world financial markets demonstrate that DeepAries significantly outperforms traditional fixed-frequency and full-rebalancing strategies in terms of risk-adjusted returns, transaction costs, and drawdowns. Additionally, we provide a live demo of DeepAries at https://deep-aries.github.io/, along with the source code and dataset at https://github.com/dmis-lab/DeepAries, illustrating DeepAries' capability to produce interpretable rebalancing and allocation decisions aligned with shifting market regimes. Overall, DeepAries introduces an innovative paradigm for adaptive and practical portfolio management by integrating both timing and allocation into a unified decision-making process.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2510.14985
  12. By: Florian Englmaier (LMU Munich); Jose E. Galdon-Sanchez (Universidad Publica de Navarra); Ricard Gil (IESE Business School); Michael Kaiser (E.CA Economics); Helene Strandt (LMU Munich)
    Abstract: This paper empirically examines how management practices affect firm productivity over the business cycle. Using plant-level high-dimensional human resource policies survey data collected in Spain in 2006, we employ unsupervised machine learning to describe clusters of management practices (“management styles”). We establish a positive correlation between a management style associated with structured management and performance prior to the 2008 financial crisis. Interestingly, this correlation turns negative during the financial crisis and positive again in the economic recovery post-2013. Our evidence suggests firms with more structured management are more likely to have practices fostering culture and intangible investments such that they focus in long-run profitability, prioritizing innovation over cost reduction, while having higher adjustment costs in the short-run through higher share of fixed assets and lower employee turnover.
    Keywords: management practices; culture; unsupervised machine learning; productivity; great recession;
    JEL: M12 D22 C38
    Date: 2025–10–15
    URL: https://d.repec.org/n?u=RePEc:rco:dpaper:548
  13. By: Joachim Wagner (Leuphana Universität Lüneburg, Institut für Volkswirtschaftslehre)
    Abstract: Firm characteristics in empirical models for margins of international trade usually enter these models in linear form. If non-linearities do matter and are ignored this leads to biased results. Researchers, however, can never be sure that all possible non-linear relationships are taken care of. A solution is provided by Kernel Regularized Least Squares (KRLS) that uses a machine learning approach to learn the functional form from the data. While in earlier applications the big picture revealed by standard empirical models and KRLS was identical this note presents a case where results from a standard approach and KRLS do differ considerably.
    Keywords: Two-way trading firms, firm level data, BEEPS data, kernel regularized least squares (KRLS)
    JEL: F14
    Date: 2025–05
    URL: https://d.repec.org/n?u=RePEc:lue:wpaper:433
  14. By: Meisenbacher, Stephen; Nestorov, Svetlozar; Norlander, Peter
    Abstract: Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 – 2025. We illustrate the potential for research and future uses in education and workforce development.
    Keywords: Labor Market Information, Online Job Vacancies, NLP methods, ML, data transparency
    JEL: J23 J24 J63
    Date: 2025–10–01
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:126336
  15. By: Huben Liu; Dimitris Papanikolaou; Lawrence D.W. Schmidt; Bryan Seegmiller
    Abstract: We use recent advances in natural language processing and large language models to construct novel measures of technology exposure for workers that span almost two centuries. Combining our measures with Census data on occupation employment, we show that technological progress over the 20th century has led to economically meaningful shifts in labor demand across occupations: it has consistently increased demand for occupations with higher education requirements, occupations that pay higher wages, and occupations with a greater fraction of female workers. Using these insights and a calibrated model, we then explore different scenarios for how advances in artificial intelligence (AI) are likely to impact employment trends in the medium run. The model predicts a reversal of past trends, with AI favoring occupations that are lower-educated, lower-paid, and more male-dominated.
    JEL: J23 J24 N3 O3 O4
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:34386
  16. By: Alexandra Piller (Study Center Gerzensee and University of Bern); Marc Schranz (University of Bern); Larissa Schwaller (University of Bern)
    Abstract: Identifying the causal effects of monetary policy is challenging due to the endogeneity of policy decisions. In recent years, high-frequency monetary policy surprises have become a popular identification strategy. To serve as a valid instrument, monetary policy surprises must be correlated with the true policy shock (relevant) while remaining uncorrelated with other shocks (exogenous). However, market-based monetary policy surprises around Federal Open Market Committee (FOMC) announcements often suffer from weak relevance and endogeneity concerns. This paper explores whether text analysis methods applied to central bank communication can help mitigate these concerns. We adopt two complementary approaches. First, to improve instrument relevance, we extend the dataset of monetary policy surprises from FOMC announcements to policy-relevant speeches by the Federal Reserve Board chair and vice chair. Second, using natural language processing techniques, we predict changes in market expectations from central bank communication, isolating the component of monetary policy surprises driven solely by communication. The resulting language-driven monetary policy surprises exhibit stronger instrument relevance, mitigate endogeneity concerns and produce impulse responses that align with standard macroeconomic theory.
    Date: 2025–10
    URL: https://d.repec.org/n?u=RePEc:szg:worpap:2505
  17. By: Yuhan Cheng; Heyang Zhou; Yanchu Liu
    Abstract: We leverage the capacity of large language models such as Generative Pre-trained Transformer (GPT) in constructing factor models for Chinese futures markets. We successfully obtain 40 factors to design single-factor and multi-factor portfolios through long-short and long-only strategies, conducting backtests during the in-sample and out-of-sample period. Comprehensive empirical analysis reveals that GPT-generated factors deliver remarkable Sharpe ratios and annualized returns while maintaining acceptable maximum drawdowns. Notably, the GPT-based factor models also achieve significant alphas over the IPCA benchmark. Moreover, these factors demonstrate significant performance across extensive robustness tests, particularly excelling after the cutoff date of GPT's training data.
    Date: 2025–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2509.23609

This nep-big issue is ©2025 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.