nep-big New Economics Papers
on Big Data
Issue of 2023‒09‒11
sixteen papers chosen by
Tom Coupé, University of Canterbury

  1. Deep Learning from Implied Volatility Surfaces By Bryan T. Kelly; Boris Kuznetsov; Semyon Malamud; Teng Andrea Xu
  2. How Nations Become Fragile: An AI-Augmented Bird’s-Eye View (with a Case Study of South Sudan) By Tohid Atashbar
  3. Entity matching with similarity encoding: A supervised learning recommendation framework for linking (big) data By Karapanagiotis, Pantelis; Liebald, Marius
  4. DeRisk: An Effective Deep Learning Framework for Credit Risk Prediction over Real-World Financial Data By Yancheng Liang; Jiajie Zhang; Hui Li; Xiaochen Liu; Yi Hu; Yong Wu; Jinyao Zhang; Yongyan Liu; Yi Wu
  5. CEO Stress, Aging, and Death By Borgschulte, Mark; Guenzel, Marius; Liu, Canyao; Malmendier, Ulrike
  6. Quantifying Outlierness of Funds from their Categories using Supervised Similarity By Dhruv Desai; Ashmita Dhiman; Tushar Sharma; Deepika Sharma; Dhagash Mehta; Stefano Pasquali
  7. Understanding Models and Model Bias with Gaussian Processes By Thomas R. Cook; Nathan M. Palmer
  8. Optimizing B2B Product Offers with Machine Learning, Mixed Logit, and Nonlinear Programming By John V. Colias; Stella Park; Elizabeth Horn
  9. GEOWEALTH: spatial wealth inequality data for the United States, 1960-2020 By Suss, Joel; Kemeny, Thomas; Connor, Dylan Shane
  10. Graph Neural Networks for Forecasting Multivariate Realized Volatility with Spillover Effects By Chao Zhang; Xingyue Pu; Mihai Cucuringu; Xiaowen Dong
  11. Estimating the Impact of the Age of Criminal Majority: Decomposing Multiple Treatments in a Regression Discontinuity Framework By Michael G. Mueller-Smith; Benjamin Pyle; Caroline Walker
  12. Does Unfairness Hurt Women? The Effects of Losing Unfair Competitions By Stefano Piasenti; Marica Valente; Roel van Veldhuizen; Gregor Pfeifer
  13. A Novel Credit Model Risk Measure: does more data lead to lower model risk in credit scoring models? By Valter T. Yoshida Jr; Alan de Genaro; Rafael Schiozer; Toni R. E. dos Santos
  14. The Bayesian Context Trees State Space Model for time series modelling and forecasting By Ioannis Papageorgiou; Ioannis Kontoyiannis
  15. Forecasting oil prices with penalized regressions, variance risk premia and Google data By Fantazzini, Dean; Kurbatskii, Alexey; Mironenkov, Alexey; Lycheva, Maria
  16. Reinforcement Learning for Financial Index Tracking By Xianhua Peng; Chenyin Gong; Xue Dong He

  1. By: Bryan T. Kelly (Yale School of Management; AQR Capital Management; NBER); Boris Kuznetsov (Ecole Polytechnique Fédérale de Lausanne; Swiss Finance Institute); Semyon Malamud (Ecole Polytechnique Fédérale de Lausanne; Swiss Finance Institute; and CEPR); Teng Andrea Xu (Ecole Polytechnique Fédérale de Lausanne)
    Abstract: We develop a novel methodology for extracting information from option implied volatility (IV) surfaces for the cross-section of stock returns, using image recognition techniques from machine learning (ML). The predictive information we identify is essentially uncorrelated with most of the existing option-implied characteristics, delivers a higher Sharpe ratio, and has a significant alpha relative to a battery of standard and option-implied factors. We show the virtue of ensemble complexity: Best results are achieved with a large ensemble of ML models, with the out-of-sample performance increasing in the ensemble size, saturating when the number of model parameters significantly exceeds the number of observations. We introduce principal linear features, an analog of principal components for ML and use them to show IV feature complexity: A low-rank rotation of the IV surface cannot explain the model performance. Our results are robust to short-sale constraints and transaction costs.
    Date: 2023–08
  2. By: Tohid Atashbar
    Abstract: In this study we introduce and apply a set of machine learning and artificial intelligence techniques to analyze multi-dimensional fragility-related data. Our analysis of the fragility data collected by the OECD for its States of Fragility index showed that the use of such techniques could provide further insights into the non-linear relationships and diverse drivers of state fragility, highlighting the importance of a nuanced and context-specific approach to understanding and addressing this multi-aspect issue. We also applied the methodology used in this paper to South Sudan, one of the most fragile countries in the world to analyze the dynamics behind the different aspects of fragility over time. The results could be used to improve the Fund’s country engagement strategy (CES) and efforts at the country.
    Keywords: Fragile and Conflict-Affected States; Fragility Trap; Fragility Syndrome; Machine Learning; Artificial Intelligence
    Date: 2023–08–11
  3. By: Karapanagiotis, Pantelis; Liebald, Marius
    Abstract: In this study, we introduce a novel entity matching (EM) framework. It com-bines state-of-the-art EM approaches based on Artificial Neural Networks (ANN) with a new similarity encoding derived from matching techniques that are preva-lent in finance and economics. Our framework is on-par or outperforms alternative end-to-end frameworks in standard benchmark cases. Because similarity encod-ing is constructed using (edit) distances instead of semantic similarities, it avoids out-of-vocabulary problems when matching dirty data. We highlight this property by applying an EM application to dirty financial firm-level data extracted from historical archives.
    Keywords: Entity matching, Entity resolution, Database linking, Machine learning, Record resolution, Similarity encoding
    JEL: C8
    Date: 2023
  4. By: Yancheng Liang; Jiajie Zhang; Hui Li; Xiaochen Liu; Yi Hu; Yong Wu; Jinyao Zhang; Yongyan Liu; Yi Wu
    Abstract: Despite the tremendous advances achieved over the past years by deep learning techniques, the latest risk prediction models for industrial applications still rely on highly handtuned stage-wised statistical learning tools, such as gradient boosting and random forest methods. Different from images or languages, real-world financial data are high-dimensional, sparse, noisy and extremely imbalanced, which makes deep neural network models particularly challenging to train and fragile in practice. In this work, we propose DeRisk, an effective deep learning risk prediction framework for credit risk prediction on real-world financial data. DeRisk is the first deep risk prediction model that outperforms statistical learning approaches deployed in our company's production system. We also perform extensive ablation studies on our method to present the most critical factors for the empirical success of DeRisk.
    Date: 2023–08
  5. By: Borgschulte, Mark (University of Illinois at Urbana-Champaign); Guenzel, Marius (Wharton School, University of Pennsylvania); Liu, Canyao (Yale University); Malmendier, Ulrike (University of California, Berkeley)
    Abstract: We assess the long-term effects of managerial stress on aging and mortality. First, we show that exposure to industry distress shocks during the Great Recession produces visible signs of aging in CEOs. Applying neural-network based machine-learning techniques to pre- and post-distress pictures, we estimate an increase in so-called apparent age by one year. Second, using data on CEOs since the mid-1970s, we estimate a 1.1-year decrease in life expectancy after an industry distress shock, but a two-year increase when anti-takeover laws insulate CEOs from market discipline. The estimated health costs are significant, also relative to other known health risks.
    Keywords: managerial stress, life expectancy, apparent-age estimation, job demands, industry distress, visual machine-learning, corporate governance
    JEL: G34 I12 M12
    Date: 2023–08
  6. By: Dhruv Desai; Ashmita Dhiman; Tushar Sharma; Deepika Sharma; Dhagash Mehta; Stefano Pasquali
    Abstract: Mutual fund categorization has become a standard tool for the investment management industry and is extensively used by allocators for portfolio construction and manager selection, as well as by fund managers for peer analysis and competitive positioning. As a result, a (unintended) miscategorization or lack of precision can significantly impact allocation decisions and investment fund managers. Here, we aim to quantify the effect of miscategorization of funds utilizing a machine learning based approach. We formulate the problem of miscategorization of funds as a distance-based outlier detection problem, where the outliers are the data-points that are far from the rest of the data-points in the given feature space. We implement and employ a Random Forest (RF) based method of distance metric learning, and compute the so-called class-wise outlier measures for each data-point to identify outliers in the data. We test our implementation on various publicly available data sets, and then apply it to mutual fund data. We show that there is a strong relationship between the outlier measures of the funds and their future returns and discuss the implications of our findings.
    Date: 2023–08
  7. By: Thomas R. Cook; Nathan M. Palmer
    Abstract: Despite growing interest in the use of complex models, such as machine learning (ML) models, for credit underwriting, ML models are difficult to interpret, and it is possible for them to learn relationships that yield de facto discrimination. How can we understand the behavior and potential biases of these models, especially if our access to the underlying model is limited? We argue that counterfactual reasoning is ideal for interpreting model behavior, and that Gaussian processes (GP) can provide approximate counterfactual reasoning while also incorporating uncertainty in the underlying model’s functional form. We illustrate with an exercise in which a simulated lender uses a biased machine model to decide credit terms. Comparing aggregate outcomes does not clearly reveal bias, but with a GP model we can estimate individual counterfactual outcomes. This approach can detect the bias in the lending model even when only a relatively small sample is available. To demonstrate the value of this approach for the more general task of model interpretability, we also show how the GP model’s estimates can be aggregated to recreate the partial density functions for the lending model.
    Keywords: models; Gaussian process; model bias
    JEL: C10 C14 C18 C45
    Date: 2023–06–15
  8. By: John V. Colias (Decision Analyst); Stella Park (AT&T); Elizabeth Horn (Decision Analyst)
    Abstract: In B2B markets, value-based pricing and selling has become an important alternative to discounting. This study outlines a modeling method that uses customer data (product offers made to each current or potential customer, features, discounts, and customer purchase decisions) to estimate a mixed logit choice model. The model is estimated via hierarchical Bayes and machine learning, delivering customer-level parameter estimates. Customer-level estimates are input into a nonlinear programming next-offer maximization problem to select optimal features and discount level for customer segments, where segments are based on loyalty and discount elasticity. The mixed logit model is integrated with economic theory (the random utility model), and it predicts both customer perceived value for and response to alternative future sales offers. The methodology can be implemented to support value-based pricing and selling efforts. Contributions to the literature include: (a) the use of customer-level parameter estimates from a mixed logit model, delivered via a hierarchical Bayes estimation procedure, to support value-based pricing decisions; (b) validation that mixed logit customer-level modeling can deliver strong predictive accuracy, not as high as random forest but comparing favorably; and (c) a nonlinear programming problem that uses customer-level mixed logit estimates to select optimal features and discounts.
    Date: 2023–08
  9. By: Suss, Joel; Kemeny, Thomas; Connor, Dylan Shane
    Abstract: Wealth inequality has been sharply rising in the United States and across many other high-income countries. Due to a lack of data, we know little about how this trend has unfolded across locations within countries. Investigating this subnational geography of wealth is crucial, as from one generation to the next, wealth powerfully shapes opportunity and disadvantage across individuals and communities. Using machine-learning-based imputation to link newly assembled national historical surveys conducted by the U.S. Federal Reserve to population survey microdata, the data presented in this paper addresses this gap. The Geographic Wealth Inequality Database (“GEOWEALTH”) provides the first estimates of the level and distribution of wealth at various geographical scales within the United States from 1960 to 2020. The GEOWEALTH database enables new lines of investigation into the contribution of spatial wealth disparities to major societal challenges including wealth concentration, spatial income inequality, social mobility, housing unaffordability, and political polarization.
    JEL: J1 N0
    Date: 2023–08–01
  10. By: Chao Zhang; Xingyue Pu; Mihai Cucuringu; Xiaowen Dong
    Abstract: We present a novel methodology for modeling and forecasting multivariate realized volatilities using customized graph neural networks to incorporate spillover effects across stocks. The proposed model offers the benefits of incorporating spillover effects from multi-hop neighbors, capturing nonlinear relationships, and flexible training with different loss functions. Our empirical findings provide compelling evidence that incorporating spillover effects from multi-hop neighbors alone does not yield a clear advantage in terms of predictive accuracy. However, modeling nonlinear spillover effects enhances the forecasting accuracy of realized volatilities, particularly for short-term horizons of up to one week. Moreover, our results consistently indicate that training with the Quasi-likelihood loss leads to substantial improvements in model performance compared to the commonly-used mean squared error. A comprehensive series of empirical evaluations in alternative settings confirm the robustness of our results.
    Date: 2023–08
  11. By: Michael G. Mueller-Smith; Benjamin Pyle; Caroline Walker
    Abstract: This paper studies the impact of adult prosecution on recidivism and employment trajectories for adolescent, first-time felony defendants. We use extensive linked Criminal Justice Admin- istrative Record System and socio-economic data from Wayne County, Michigan (Detroit). Using the discrete age of majority rule and a regression discontinuity design, we find that adult prosecution reduces future criminal charges over 5 years by 0.48 felony cases (↓ 20%) while also worsening labor market outcomes: 0.76 fewer employers (↓ 19%) and $674 fewer earnings (↓ 21%) per year. We develop a novel econometric framework that combines standard regression discontinuity methods with predictive machine learning models to identify mechanism-specific treatment effects that underpin the overall impact of adult prosecution. We leverage these estimates to consider four policy counterfactuals: (1) raising the age of majority, (2) increasing adult dismissals to match the juvenile disposition rates, (3) eliminating adult incarceration, and (4) expanding juvenile record sealing opportunities to teenage adult defendants. All four scenarios generate positive returns for government budgets. When accounting for impacts to defendants as well as victim costs borne by society stemming from increases in recidivism, we find positive social returns for juvenile record sealing expansions and dismissing marginal adult charges; raising the age of majority breaks even. Eliminating prison for first-time adult felony defendants, however, increases net social costs. Policymakers may still find this attractive if they are willing to value beneficiaries (taxpayers and defendants) slightly higher (124%) than potential victims.
    JEL: C36 C45 J24 K14 K42
    Date: 2023–08
  12. By: Stefano Piasenti; Marica Valente; Roel van Veldhuizen; Gregor Pfeifer
    Abstract: How do men and women differ in their persistence after experiencing failure in a competitive environment? We tackle this question by combining a large online experiment (N=2, 086) with machine learning. We find that when losing is unequivocally due to merit, both men and women exhibit a significant decrease in subsequent tournament entry. However, when the prior tournament is unfair, i.e., a loss is no longer necessarily based on merit, women are more discouraged than men. These results suggest that transparent meritocratic criteria may play a key role in preventing women from falling behind after experiencing a loss.
    Keywords: Competitiveness, Gender, Fairness, Machine learning, Online experiment
    JEL: C90 D91 J16 C14
    Date: 2023–11
  13. By: Valter T. Yoshida Jr; Alan de Genaro; Rafael Schiozer; Toni R. E. dos Santos
    Abstract: Large databases and Machine Learning have increased our ability to produce models with a different number of observations and explanatory variables. The credit scoring literature has focused on the optimization of classifications. Little attention has been paid to the inadequate use of models. This study fills this gap by focusing on model risk. It proposes a measure to assess credit scoring model risk. Its emphasis is on model misuse. The proposed model risk measure is ordinal, and it applies to many settings and types of loan portfolios, allowing comparisons of different specifications and situations (as in-sample or out-of-sample data). It allows practitioners and regulators to evaluate and compare different credit risk models in terms of model risk. We empirically test our measure in plugin LASSO default models and find that adding loans from different banks to increase the number of observations is not optimal, challenging the generally accepted assumption that more data leads to better predictions.
    Date: 2023–08
  14. By: Ioannis Papageorgiou; Ioannis Kontoyiannis
    Abstract: A hierarchical Bayesian framework is introduced for developing rich mixture models for real-valued time series, along with a collection of effective tools for learning and inference. At the top level, meaningful discrete states are identified as appropriately quantised values of some of the most recent samples. This collection of observable states is described as a discrete context-tree model. Then, at the bottom level, a different, arbitrary model for real-valued time series - a base model - is associated with each state. This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. We call this the Bayesian Context Trees State Space Model, or the BCT-X framework. Efficient algorithms are introduced that allow for effective, exact Bayesian inference; in particular, the maximum a posteriori probability (MAP) context-tree model can be identified. These algorithms can be updated sequentially, facilitating efficient online forecasting. The utility of the general framework is illustrated in two particular instances: When autoregressive (AR) models are used as base models, resulting in a nonlinear AR mixture model, and when conditional heteroscedastic (ARCH) models are used, resulting in a mixture model that offers a powerful and systematic way of modelling the well-known volatility asymmetries in financial data. In forecasting, the BCT-X methods are found to outperform state-of-the-art techniques on simulated and real-world data, both in terms of accuracy and computational requirements. In modelling, the BCT-X structure finds natural structure present in the data. In particular, the BCT-ARCH model reveals a novel, important feature of stock market index data, in the form of an enhanced leverage effect.
    Date: 2023–08
  15. By: Fantazzini, Dean; Kurbatskii, Alexey; Mironenkov, Alexey; Lycheva, Maria
    Abstract: This paper investigates whether augmenting models with the variance risk premium (VRP) and Google search data improves the quality of the forecasts for real oil prices. We considered a time sample of monthly data from 2007 to 2019 that includes several episodes of high volatility in the oil market. Our evidence shows that penalized regressions provided the best forecasting performances across most of the forecasting horizons. Moreover, we found that models using the VRP as an additional predictor performed best for forecasts up to 6-12 months ahead forecasts, while models using Google data as an additional predictor performed better for longer-term forecasts up to 12-24 months ahead. However, we found that the differences in forecasting performances were not statistically different for most models, and only the Principal Component Regression (PCR) and the Partial least squares (PLS) regression were consistently excluded from the set of best forecasting models. These results also held after a set of robustness checks that considered model specifications using a wider set of influential variables, a Hierarchical Vector Auto-Regression model estimated with the LASSO, and a set of forecasting models using a simplified specification for Google Trends data.
    Keywords: Oil price; Variance Risk Premium; Google Trends; VAR; LASSO; Ridge; Elastic Net; Principal compo-nents, Partial least squares
    JEL: C22 C32 C52 C53 C55 C58 G17 O13 Q47
    Date: 2022
  16. By: Xianhua Peng; Chenyin Gong; Xue Dong He
    Abstract: We propose the first discrete-time infinite-horizon dynamic formulation of the financial index tracking problem under both return-based tracking error and value-based tracking error. The formulation overcomes the limitations of existing models by incorporating the intertemporal dynamics of market information variables not limited to prices, allowing exact calculation of transaction costs, accounting for the tradeoff between overall tracking error and transaction costs, allowing effective use of data in a long time period, etc. The formulation also allows novel decision variables of cash injection or withdraw. We propose to solve the portfolio rebalancing equation using a Banach fixed point iteration, which allows to accurately calculate the transaction costs specified as nonlinear functions of trading volumes in practice. We propose an extension of deep reinforcement learning (RL) method to solve the dynamic formulation. Our RL method resolves the issue of data limitation resulting from the availability of a single sample path of financial data by a novel training scheme. A comprehensive empirical study based on a 17-year-long testing set demonstrates that the proposed method outperforms a benchmark method in terms of tracking accuracy and has the potential for earning extra profit through cash withdraw strategy.
    Date: 2023–08

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.