nep-big New Economics Papers
on Big Data
Issue of 2024‒07‒08
fifteen papers chosen by
Tom Coupé, University of Canterbury


  1. Distributional Refinement Network: Distributional Forecasting via Deep Learning By Benjamin Avanzi; Eric Dong; Patrick J. Laub; Bernard Wong
  2. Machine Learning Methods for Pricing Financial Derivatives By Lei Fan; Justin Sirignano
  3. How Inductive Bias in Machine Learning Aligns with Optimality in Economic Dynamics By Mahdi Ebrahimi Kahou; James Yu; Jesse Perla; Geoff Pleiss
  4. Algorithmic Collusion in Dynamic Pricing with Deep Reinforcement Learning By Shidi Deng; Maximilian Schiffer; Martin Bichler
  5. Using Cross-Survey Imputation to Estimate Poverty for Venezuelan Refugees in Colombia By Sarr, Ibrahima; Dang, Hai-Anh; Gutierrez, Carlos Santiago Guzman; Beltramo, Theresa; Verme, Paolo
  6. Unemployment Insurance Fraud in the Debit Card Market By Umang Khetan; Jetson Leder-Luis; Jialan Wang; Yunrong Zhou
  7. Modèles internes des banques pour le calcul du capital réglementaire (IRB) et intelligence artificielle By Henri Fraisse; Christophe Hurlin
  8. Distributional impacts of climate policy and effective compensation: Evidence from 88 countries By Missbach, Leonard; Steckel, Jan Christoph
  9. Low-dimensional approximations of the conditional law of Volterra processes: a non-positive curvature approach By Reza Arabpour; John Armstrong; Luca Galimberti; Anastasis Kratsios; Giulia Livieri
  10. A K-means Algorithm for Financial Market Risk Forecasting By Jinxin Xu; Kaixian Xu; Yue Wang; Qinyan Shen; Ruisi Li
  11. Influencer Cartels By Marit Hinnosaar; Toomas Hinnosaar
  12. The Accuracy of Domain Specific and Descriptive Analysis Generated by Large Language Models By Denish Omondi Otieno; Faranak Abri; Sima Siami-Namini; Akbar Siami Namin
  13. Regulatory Compliance with Limited Enforceability: Evidence from Privacy Policies By Bernhard Ganglmair; Julia Krämer; Jacopo Gambato
  14. Predicting Police Misconduct By Greg Stoddard; Dylan J. Fitzpatrick; Jens Ludwig
  15. Confinement policies: controlling contagion without compromising mental health By Ariadna García-Prado; Paula González; Yolanda F. Rebollo-Sanz

  1. By: Benjamin Avanzi; Eric Dong; Patrick J. Laub; Bernard Wong
    Abstract: A key task in actuarial modelling involves modelling the distributional properties of losses. Classic (distributional) regression approaches like Generalized Linear Models (GLMs; Nelder and Wedderburn, 1972) are commonly used, but challenges remain in developing models that can (i) allow covariates to flexibly impact different aspects of the conditional distribution, (ii) integrate developments in machine learning and AI to maximise the predictive power while considering (i), and, (iii) maintain a level of interpretability in the model to enhance trust in the model and its outputs, which is often compromised in efforts pursuing (i) and (ii). We tackle this problem by proposing a Distributional Refinement Network (DRN), which combines an inherently interpretable baseline model (such as GLMs) with a flexible neural network-a modified Deep Distribution Regression (DDR; Li et al., 2019) method. Inspired by the Combined Actuarial Neural Network (CANN; Schelldorfer and W{\''u}thrich, 2019), our approach flexibly refines the entire baseline distribution. As a result, the DRN captures varying effects of features across all quantiles, improving predictive performance while maintaining adequate interpretability. Using both synthetic and real-world data, we demonstrate the DRN's superior distributional forecasting capacity. The DRN has the potential to be a powerful distributional regression model in actuarial science and beyond.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.00998&r=
  2. By: Lei Fan; Justin Sirignano
    Abstract: Stochastic differential equation (SDE) models are the foundation for pricing and hedging financial derivatives. The drift and volatility functions in SDE models are typically chosen to be algebraic functions with a small number (less than 5) parameters which can be calibrated to market data. A more flexible approach is to use neural networks to model the drift and volatility functions, which provides more degrees-of-freedom to match observed market data. Training of models requires optimizing over an SDE, which is computationally challenging. For European options, we develop a fast stochastic gradient descent (SGD) algorithm for training the neural network-SDE model. Our SGD algorithm uses two independent SDE paths to obtain an unbiased estimate of the direction of steepest descent. For American options, we optimize over the corresponding Kolmogorov partial differential equation (PDE). The neural network appears as coefficient functions in the PDE. Models are trained on large datasets (many contracts), requiring either large simulations (many Monte Carlo samples for the stock price paths) or large numbers of PDEs (a PDE must be solved for each contract). Numerical results are presented for real market data including S&P 500 index options, S&P 100 index options, and single-stock American options. The neural-network-based SDE models are compared against the Black-Scholes model, the Dupire's local volatility model, and the Heston model. Models are evaluated in terms of how accurate they are at pricing out-of-sample financial derivatives, which is a core task in derivative pricing at financial institutions.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.00459&r=
  3. By: Mahdi Ebrahimi Kahou; James Yu; Jesse Perla; Geoff Pleiss
    Abstract: This paper examines the alignment of inductive biases in machine learning (ML) with structural models of economic dynamics. Unlike dynamical systems found in physical and life sciences, economics models are often specified by differential equations with a mixture of easy-to-enforce initial conditions and hard-to-enforce infinite horizon boundary conditions (e.g. transversality and no-ponzi-scheme conditions). Traditional methods for enforcing these constraints are computationally expensive and unstable. We investigate algorithms where those infinite horizon constraints are ignored, simply training unregularized kernel machines and neural networks to obey the differential equations. Despite the inherent underspecification of this approach, our findings reveal that the inductive biases of these ML models innately enforce the infinite-horizon conditions necessary for the well-posedness. We theoretically demonstrate that (approximate or exact) min-norm ML solutions to interpolation problems are sufficient conditions for these infinite-horizon boundary conditions in a wide class of problems. We then provide empirical evidence that deep learning and ridgeless kernel methods are not only theoretically sound with respect to economic assumptions, but may even dominate classic algorithms in low to medium dimensions. More importantly, these results give confidence that, despite solving seemingly ill-posed problems, there are reasons to trust the plethora of black-box ML algorithms used by economists to solve previously intractable, high-dimensional dynamical systems -- paving the way for future work on estimation of inverse problems with embedded optimal control problems.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.01898&r=
  4. By: Shidi Deng; Maximilian Schiffer; Martin Bichler
    Abstract: Nowadays, a significant share of the Business-to-Consumer sector is based on online platforms like Amazon and Alibaba and uses Artificial Intelligence for pricing strategies. This has sparked debate on whether pricing algorithms may tacitly collude to set supra-competitive prices without being explicitly designed to do so. Our study addresses these concerns by examining the risk of collusion when Reinforcement Learning algorithms are used to decide on pricing strategies in competitive markets. Prior research in this field focused on Tabular Q-learning (TQL) and led to opposing views on whether learning-based algorithms can lead to supra-competitive prices. Our work contributes to this ongoing discussion by providing a more nuanced numerical study that goes beyond TQL by additionally capturing off- and on-policy Deep Reinforcement Learning (DRL) algorithms. We study multiple Bertrand oligopoly variants and show that algorithmic collusion depends on the algorithm used. In our experiments, TQL exhibits higher collusion and price dispersion phenomena compared to DRL algorithms. We show that the severity of collusion depends not only on the algorithm used but also on the characteristics of the market environment. We further find that Proximal Policy Optimization appears to be less sensitive to collusive outcomes compared to other state-of-the-art DRL algorithms.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.02437&r=
  5. By: Sarr, Ibrahima (United Nations High Commissioner for Refugees); Dang, Hai-Anh (World Bank); Gutierrez, Carlos Santiago Guzman (University of Oxford); Beltramo, Theresa (United Nations High Commissioner for Refugees); Verme, Paolo (World Bank)
    Abstract: Household consumption or income surveys do not typically cover refugee populations. In the rare cases where refuges are included, inconsistencies between different data sources could interfere with comparable poverty estimates. We test the performance of a recently developed cross-survey imputation method to estimate poverty for a sample of refugees in Colombia, combining household income surveys collected by the Government of Colombia and administrative data collected by the United Nations High Commissioner for Refugees. We find that certain variable transformation methods can help resolve these inconsistencies. Estimation results with our preferred variable standardization method are robust to different imputation methods, including the normal linear regression method, the empirical distribution of the errors method, and the probit and logit methods. We also employ several common machine learning techniques such as Random Forest, Lasso, Ridge, and elastic regressions for robustness checks, but these techniques generally perform worse than the imputation methods that we use. We also find that we can reasonably impute poverty rates using an older household income survey and a more recent ProGres dataset for most of the poverty lines. These results provide relevant inputs into designing better surveys and administrate datasets on refugees in various country settings.
    Keywords: refugees, poverty, imputation, Colombia
    JEL: C15 F22 I32 O15 O20
    Date: 2024–05
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp17036&r=
  6. By: Umang Khetan; Jetson Leder-Luis; Jialan Wang; Yunrong Zhou
    Abstract: We study fraud in the unemployment insurance (UI) system using a dataset of 35 million debit card transactions. We apply machine learning techniques to cluster cards corresponding to varying levels of suspicious or potentially fraudulent activity. We then conduct a difference-in-differences analysis based on the staggered adoption of state-level identity verification systems between 2020 and 2021 to assess the effectiveness of screening for reducing fraud. Our findings suggest that identity verification reduced payouts to suspicious cards by 27%, while non-suspicious cards were largely unaffected by these technologies. Our results indicate that identity screening may be an effective mechanism for mitigating fraud in the UI system and for benefits programs more broadly.
    JEL: G51 H53 J65 K42
    Date: 2024–05
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:32527&r=
  7. By: Henri Fraisse; Christophe Hurlin
    Abstract: This note outlines the issues, risks and benefits of machine learning models for the design of internal credit risk assessment models used by banking institutions for the calculation of their own funds requirement ("Credit IRB Approach"). The use of ML models in IRB models is currently marginal. However, it could improve the predictive quality of models and in some cases lead to a reduction in capital requirements. However, ML models face a lack of interpretability that substitution or local approximation methods do not solve. <p> Cette note expose les enjeux, les risques et les avantages des modèles d’apprentissage automatique (« Machine Learning ») pour la conception des modèles internes d’évaluation de risque de crédit utilisés par les établissements bancaires dans le cadre du calcul de leur exigence en fond propre (« Approche IRB Crédit »). L’utilisation des modèles ML dans le cadre des modèles IRB est pour l’instant marginale. Elle pourrait pourtant permettre d’améliorer la qualité prédictive des modèles et dans certains cas conduire à une réduction des exigences de fonds propres. Toutefois les modèles ML se heurtent à un déficit d’interprétabilité que les méthodes par substitution ou d’approximation locale ne résolvent pas.
    Keywords: Machine Learning; banking prudential regulation; internal models; regulatory capital; Machine Learning ; réglementation prudentielle bancaire ; modèles internes ; capital réglementaire
    JEL: G21 G29 C10 C38 C55
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:bfr:decfin:44&r=
  8. By: Missbach, Leonard; Steckel, Jan Christoph
    Abstract: We analyze the distributional impacts of climate policy by examining heterogeneity in households' carbon intensity of consumption. We construct a novel dataset that includes information on the carbon intensity of 1.5 million individual households from 88 countries. We first show that horizontal differences are generally larger than vertical differences. We then use supervised machine learning to analyze the non-linear contribution of household characteristics to the prediction of carbon intensity of consumption. Including household-level information beyond total household expenditures, such as information on vehicle ownership, location, and energy use, increases the accuracy of predicting households' carbon intensity. The importance of such features is country-specific and model accuracy varies across the sample. We identify six clusters of countries that differ in the distribution of climate policy costs and their determinants. Our results highlight that, depending on the context, some compensation policies may be more effective in reducing horizontal heterogeneity than others.
    Keywords: Climate policy, Distributional impacts, Inequality, Transfers
    JEL: C38 C55 D30 H23 Q56
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:zbw:esprep:296491&r=
  9. By: Reza Arabpour; John Armstrong; Luca Galimberti; Anastasis Kratsios; Giulia Livieri
    Abstract: Predicting the conditional evolution of Volterra processes with stochastic volatility is a crucial challenge in mathematical finance. While deep neural network models offer promise in approximating the conditional law of such processes, their effectiveness is hindered by the curse of dimensionality caused by the infinite dimensionality and non-smooth nature of these problems. To address this, we propose a two-step solution. Firstly, we develop a stable dimension reduction technique, projecting the law of a reasonably broad class of Volterra process onto a low-dimensional statistical manifold of non-positive sectional curvature. Next, we introduce a sequentially deep learning model tailored to the manifold's geometry, which we show can approximate the projected conditional law of the Volterra process. Our model leverages an auxiliary hypernetwork to dynamically update its internal parameters, allowing it to encode non-stationary dynamics of the Volterra process, and it can be interpreted as a gating mechanism in a mixture of expert models where each expert is specialized at a specific point in time. Our hypernetwork further allows us to achieve approximation rates that would seemingly only be possible with very large networks.
    Date: 2024–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2405.20094&r=
  10. By: Jinxin Xu; Kaixian Xu; Yue Wang; Qinyan Shen; Ruisi Li
    Abstract: Financial market risk forecasting involves applying mathematical models, historical data analysis and statistical methods to estimate the impact of future market movements on investments. This process is crucial for investors to develop strategies, financial institutions to manage assets and regulators to formulate policy. In today's society, there are problems of high error rate and low precision in financial market risk prediction, which greatly affect the accuracy of financial market risk prediction. K-means algorithm in machine learning is an effective risk prediction technique for financial market. This study uses K-means algorithm to develop a financial market risk prediction system, which significantly improves the accuracy and efficiency of financial market risk prediction. Ultimately, the outcomes of the experiments confirm that the K-means algorithm operates with user-friendly simplicity and achieves a 94.61% accuracy rate
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2405.13076&r=
  11. By: Marit Hinnosaar; Toomas Hinnosaar
    Abstract: Social media influencers account for a growing share of marketing worldwide. We demonstrate the existence of a novel form of market failure in this advertising market: influencer cartels, where groups of influencers collude to increase their advertising revenue by inflating their engagement. Our theoretical model shows that influencer cartels can improve consumer welfare if they expand social media engagement to the target audience, or reduce welfare if they divert engagement to less relevant audiences. We validate the model empirically using novel data on influencer cartels combined with machine learning tools, and derive policy implications for how to maximize consumer welfare.
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2405.10231&r=
  12. By: Denish Omondi Otieno; Faranak Abri; Sima Siami-Namini; Akbar Siami Namin
    Abstract: Large language models (LLMs) have attracted considerable attention as they are capable of showcasing impressive capabilities generating comparable high-quality responses to human inputs. LLMs, can not only compose textual scripts such as emails and essays but also executable programming code. Contrary, the automated reasoning capability of these LLMs in performing statistically-driven descriptive analysis, particularly on user-specific data and as personal assistants to users with limited background knowledge in an application domain who would like to carry out basic, as well as advanced statistical and domain-specific analysis is not yet fully explored. More importantly, the performance of these LLMs has not been compared and discussed in detail when domain-specific data analysis tasks are needed. This study, consequently, explores whether LLMs can be used as generative AI-based personal assistants to users with minimal background knowledge in an application domain infer key data insights. To demonstrate the performance of the LLMs, the study reports a case study through which descriptive statistical analysis, as well as Natural Language Processing (NLP) based investigations, are performed on a number of phishing emails with the objective of comparing the accuracy of the results generated by LLMs to the ones produced by analysts. The experimental results show that LangChain and the Generative Pre-trained Transformer (GPT-4) excel in numerical reasoning tasks i.e., temporal statistical analysis, achieve competitive correlation with human judgments on feature engineering tasks while struggle to some extent on domain specific knowledge reasoning, where domain-specific knowledge is required.
    Date: 2024–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2405.19578&r=
  13. By: Bernhard Ganglmair; Julia Krämer; Jacopo Gambato
    Abstract: The EU General Data Protection Regulation (GDPR) of 2018 introduced stringent transparency rules compelling firms to disclose, in accessible language, details of their data collection, processing, and use. The specifics of the disclosure requirement are objective, and its compliance is easily verifiable; readability, however, is subjective and difficult to enforce. We use a simple inspection model to show how this asymmetric enforceability of regulatory rules and the corresponding firm compliance are linked. We then examine this link empirically using a large sample of privacy policies from German firms. We use text-as-data techniques to construct measures of disclosure and readability and show that firms increased the disclosure volume, but the readability of their privacy policies did not improve. Larger firms in concentrated industries demonstrated a stronger response in readability compliance, potentially due to heightened regulatory scrutiny. Moreover, data protection authorities with larger budgets induce better readability compliance without effects on disclosure.
    Keywords: data protection, disclosure, GDPR, privacy policies, readability, regulation, text-as-data, topic models
    JEL: C81 D23 K12 K20 L51 M15
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:bon:boncrc:crctr224_2024_547&r=
  14. By: Greg Stoddard; Dylan J. Fitzpatrick; Jens Ludwig
    Abstract: Whether police misconduct can be prevented depends partly on whether it can be predicted. We show police misconduct is partially predictable and that estimated misconduct risk is not simply an artifact of measurement error or a proxy for officer activity. We also show many officers at risk of on-duty misconduct have elevated off-duty risk too, suggesting a potential link between accountability and officer wellness. We show that targeting preventive interventions even with a simple prediction model – number of past complaints, which is not as predictive as machine learning but lower-cost to deploy – has marginal value of public funds of infinity.
    JEL: C0 K0
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:32432&r=
  15. By: Ariadna García-Prado (Universidad Pública de Navarra); Paula González (Universidad Pablo de Olavide); Yolanda F. Rebollo-Sanz (Universidad Pablo de Olavide)
    Abstract: A growing literature shows that confinement policies used by governments to slow COVID-19 transmission have negative impacts on mental health, but the differential effects of individual policies on mental health remain poorly understood. We used data from the COVID-19 questionnaire of the Survey of Health, Ageing and Retirement in Europe (SHARE), which focuses on populations aged 50 and older, and the Oxford COVID-19 Government Response Tracker for 28 countries to estimate the effects of eight different confinement policies on three outcomes of mental health: insomnia, anxiety and depression. We applied robust machine learning methods to estimate the effects of interest. Our results indicate that closure of schools and public transportation, restrictions on domestic and international travel, and gathering restrictions did not worsen the mental health of older populations in Europe. In contrast, stay at home policies and workplace closures aggravated the three health outcomes analyzed. Based on these findings, we close with a discussion of which policies should be implemented, intensified, or relaxed to control the spread of the virus without compromising the mental health of older populations.
    Keywords: COVID-19, mental health, confinement policies, older populations, Europe, robust machine learning methods.
    JEL: I18 I31
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:pab:wpaper:24.03&r=

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.