nep-big New Economics Papers
on Big Data
Issue of 2024‒02‒26
twenty papers chosen by
Tom Coupé, University of Canterbury


  1. Satellites turn “concrete”: tracking cement with satellite data and neural networks By d’Aspremont, Alexandre; Arous, Simon Ben; Bricongne, Jean-Charles; Lietti, Benjamin; Meunier, Baptiste
  2. How to use machine learning in finance By Mestiri, Sami
  3. BioFinBERT: Finetuning Large Language Models (LLMs) to Analyze Sentiment of Press Releases and Financial Text Around Inflection Points of Biotech Stocks By Valentina Aparicio; Daniel Gordon; Sebastian G. Huayamares; Yuhuai Luo
  4. Evaluating the Determinants of Mode Choice Using Statistical and Machine Learning Techniques in the Indian Megacity of Bengaluru By Tanmay Ghosh; Nithin Nagaraj
  5. Transformer-based approach for Ethereum Price Prediction Using Crosscurrency correlation and Sentiment Analysis By Shubham Singh; Mayur Bhat
  6. Cross-Domain Behavioral Credit Modeling: transferability from private to central data By O. Didkovskyi; N. Jean; G. Le Pera; C. Nordio
  7. MTRGL:Effective Temporal Correlation Discerning through Multi-modal Temporal Relational Graph Learning By Junwei Su; Shan Wu; Jinhui Li
  8. Large (and Deep) Factor Models By Bryan T. Kelly; Boris Kuznetsov; Semyon Malamud; Teng Andrea Xu
  9. From Numbers to Words: Multi-Modal Bankruptcy Prediction Using the ECL Dataset By Henri Arno; Klaas Mulier; Joke Baeck; Thomas Demeester
  10. Learning the Market: Sentiment-Based Ensemble Trading Agents By Andrew Ye; James Xu; Yi Wang; Yifan Yu; Daniel Yan; Ryan Chen; Bosheng Dong; Vipin Chaudhary; Shuai Xu
  11. Causal Machine Learning for Moderation Effects By Nora Bearth; Michael Lechner
  12. Data-driven Option Pricing By Min Dai; Hanqing Jin; Xi Yang
  13. Bank Business Models, Size, and Profitability By F. Bolivar; M. A. Duran; A. Lozano-Vivas
  14. How to Use Data Science in Economics -- a Classroom Game Based on Cartel Detection By Hannes Wallimann; Silvio Sticher
  15. Deep Generative Modeling for Financial Time Series with Application in VaR: A Comparative Review By Lars Ericson; Xuejun Zhu; Xusi Han; Rao Fu; Shuang Li; Steve Guo; Ping Hu
  16. Using Satellite Imagery to Detect the Impacts of New Highways: An Application to India By Kathryn Baragwanath Vogel; Gordon H. Hanson; Amit Khandelwal; Chen Liu; Hogeun Park
  17. Renoncer à la liberté : Comprendre les choix des détenus en matière de libération conditionnelle By Guy Lacroix; William Arbour
  18. A Novel Decision Ensemble Framework: Customized Attention-BiLSTM and XGBoost for Speculative Stock Price Forecasting By Riaz Ud Din; Salman Ahmed; Saddam Hussain Khan
  19. Neural Hawkes: Non-Parametric Estimation in High Dimension and Causality Analysis in Cryptocurrency Markets By Timoth\'ee Fabre; Ioane Muni Toke
  20. The Arrival of Fast Internet and Employment in Africa: Comment By David Roodman

  1. By: d’Aspremont, Alexandre; Arous, Simon Ben; Bricongne, Jean-Charles; Lietti, Benjamin; Meunier, Baptiste
    Abstract: This paper exploits daily infrared images taken from satellites to track economic activity in advanced and emerging countries. We first develop a framework to read, clean, and exploit satellite images. Our algorithm uses the laws of physics (Planck’s law) and machine learning to detect the heat produced by cement plants in activity. This allows us to monitor in real-time whether a cement plant is working. Using this information on around 500 plants, we construct a satellite-based index tracking activity. We show that using this satellite index outperforms benchmark models and alternative indicators for nowcasting the production of the cement industry as well as the activity in the construction sector. Comparing across methods, we find neural networks yields significantly more accurate predictions as they allow to exploit the granularity of our daily and plant-level data. Overall, we show that combining satellite images and machine learning allows to track economic activity accurately. JEL Classification: C51, C81, E23, E37
    Keywords: big data, construction, data science, high-frequency data, machine learning
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:ecb:ecbwps:20242900&r=big
  2. By: Mestiri, Sami
    Abstract: In the last years, the financial sector has seen an increase in the use of machine learning models in banking and insurance contexts. Advanced analytic teams in the financial community are implementing these models regularly. In this paper, i present the different Machine Learning techniques used, and provide some suggestions on the choice of methods in financial applications. We refer the reader to the R packages that can be used to compute the Machine learning methods
    Keywords: Financial applications; Machine learning ; R software.
    JEL: C45 G00
    Date: 2023–10
    URL: http://d.repec.org/n?u=RePEc:pra:mprapa:120045&r=big
  3. By: Valentina Aparicio; Daniel Gordon; Sebastian G. Huayamares; Yuhuai Luo
    Abstract: Large language models (LLMs) are deep learning algorithms being used to perform natural language processing tasks in various fields, from social sciences to finance and biomedical sciences. Developing and training a new LLM can be very computationally expensive, so it is becoming a common practice to take existing LLMs and finetune them with carefully curated datasets for desired applications in different fields. Here, we present BioFinBERT, a finetuned LLM to perform financial sentiment analysis of public text associated with stocks of companies in the biotechnology sector. The stocks of biotech companies developing highly innovative and risky therapeutic drugs tend to respond very positively or negatively upon a successful or failed clinical readout or regulatory approval of their drug, respectively. These clinical or regulatory results are disclosed by the biotech companies via press releases, which are followed by a significant stock response in many cases. In our attempt to design a LLM capable of analyzing the sentiment of these press releases, we first finetuned BioBERT, a biomedical language representation model designed for biomedical text mining, using financial textual databases. Our finetuned model, termed BioFinBERT, was then used to perform financial sentiment analysis of various biotech-related press releases and financial text around inflection points that significantly affected the price of biotech stocks.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.11011&r=big
  4. By: Tanmay Ghosh; Nithin Nagaraj
    Abstract: The decision making involved behind the mode choice is critical for transportation planning. While statistical learning techniques like discrete choice models have been used traditionally, machine learning (ML) models have gained traction recently among the transportation planners due to their higher predictive performance. However, the black box nature of ML models pose significant interpretability challenges, limiting their practical application in decision and policy making. This study utilised a dataset of $1350$ households belonging to low and low-middle income bracket in the city of Bengaluru to investigate mode choice decision making behaviour using Multinomial logit model and ML classifiers like decision trees, random forests, extreme gradient boosting and support vector machines. In terms of accuracy, random forest model performed the best ($0.788$ on training data and $0.605$ on testing data) compared to all the other models. This research has adopted modern interpretability techniques like feature importance and individual conditional expectation plots to explain the decision making behaviour using ML models. A higher travel costs significantly reduce the predicted probability of bus usage compared to other modes (a $0.66\%$ and $0.34\%$ reduction using Random Forests and XGBoost model for $10\%$ increase in travel cost). However, reducing travel time by $10\%$ increases the preference for the metro ($0.16\%$ in Random Forests and 0.42% in XGBoost). This research augments the ongoing research on mode choice analysis using machine learning techniques, which would help in improving the understanding of the performance of these models with real-world data in terms of both accuracy and interpretability.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.13977&r=big
  5. By: Shubham Singh; Mayur Bhat
    Abstract: The research delves into the capabilities of a transformer-based neural network for Ethereum cryptocurrency price forecasting. The experiment runs around the hypothesis that cryptocurrency prices are strongly correlated with other cryptocurrencies and the sentiments around the cryptocurrency. The model employs a transformer architecture for several setups from single-feature scenarios to complex configurations incorporating volume, sentiment, and correlated cryptocurrency prices. Despite a smaller dataset and less complex architecture, the transformer model surpasses ANN and MLP counterparts on some parameters. The conclusion presents a hypothesis on the illusion of causality in cryptocurrency price movements driven by sentiments.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.08077&r=big
  6. By: O. Didkovskyi; N. Jean; G. Le Pera; C. Nordio
    Abstract: This paper introduces a credit risk rating model for credit risk assessment in quantitative finance, aiming to categorize borrowers based on their behavioral data. The model is trained on data from Experian, a widely recognized credit bureau, to effectively identify instances of loan defaults among bank customers. Employing state-of-the-art statistical and machine learning techniques ensures the model's predictive accuracy. Furthermore, we assess the model's transferability by testing it on behavioral data from the Bank of Italy, demonstrating its potential applicability across diverse datasets during prediction. This study highlights the benefits of incorporating external behavioral data to improve credit risk assessment in financial institutions.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.09778&r=big
  7. By: Junwei Su; Shan Wu; Jinhui Li
    Abstract: In this study, we explore the synergy of deep learning and financial market applications, focusing on pair trading. This market-neutral strategy is integral to quantitative finance and is apt for advanced deep-learning techniques. A pivotal challenge in pair trading is discerning temporal correlations among entities, necessitating the integration of diverse data modalities. Addressing this, we introduce a novel framework, Multi-modal Temporal Relation Graph Learning (MTRGL). MTRGL combines time series data and discrete features into a temporal graph and employs a memory-based temporal graph neural network. This approach reframes temporal correlation identification as a temporal graph link prediction task, which has shown empirical success. Our experiments on real-world datasets confirm the superior performance of MTRGL, emphasizing its promise in refining automated pair trading strategies.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.14199&r=big
  8. By: Bryan T. Kelly (Yale SOM; AQR Capital Management, LLC; National Bureau of Economic Research (NBER)); Boris Kuznetsov (Swiss Finance Institute; EPFL); Semyon Malamud (Ecole Polytechnique Federale de Lausanne; Centre for Economic Policy Research (CEPR); Swiss Finance Institute); Teng Andrea Xu (École Polytechnique Fédérale de Lausanne)
    Abstract: We open up the black box behind Deep Learning for portfolio optimization and prove that a sufficiently wide and arbitrarily deep neural network (DNN) trained to maximize the Sharpe ratio of the Stochastic Discount Factor (SDF) is equivalent to a large factor model (LFM): A linear factor pricing model that uses many non-linear characteristics. The nature of these characteristics depends on the architecture of the DNN in an explicit, tractable fashion. This makes it possible to derive end-to-end trained DNN-based SDFs in closed form for the first time. We evaluate LFMs empirically and show how various architectural choices impact SDF performance. We document the virtue of depth complexity: With enough data, the out-of-sample performance of DNNSDF is increasing in the NN depth, saturating at huge depths of around 100 hidden layers.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:chf:rpseri:rp23121&r=big
  9. By: Henri Arno; Klaas Mulier; Joke Baeck; Thomas Demeester
    Abstract: In this paper, we present ECL, a novel multi-modal dataset containing the textual and numerical data from corporate 10K filings and associated binary bankruptcy labels. Furthermore, we develop and critically evaluate several classical and neural bankruptcy prediction models using this dataset. Our findings suggest that the information contained in each data modality is complementary for bankruptcy prediction. We also see that the binary bankruptcy prediction target does not enable our models to distinguish next year bankruptcy from an unhealthy financial situation resulting in bankruptcy in later years. Finally, we explore the use of LLMs in the context of our task. We show how GPT-based models can be used to extract meaningful summaries from the textual data but zero-shot bankruptcy prediction results are poor. All resources required to access and update the dataset or replicate our experiments are available on github.com/henriarnoUG/ECL.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.12652&r=big
  10. By: Andrew Ye; James Xu; Yi Wang; Yifan Yu; Daniel Yan; Ryan Chen; Bosheng Dong; Vipin Chaudhary; Shuai Xu
    Abstract: We propose the integration of sentiment analysis and deep-reinforcement learning ensemble algorithms for stock trading, and design a strategy capable of dynamically altering its employed agent given concurrent market sentiment. In particular, we create a simple-yet-effective method for extracting news sentiment and combine this with general improvements upon existing works, resulting in automated trading agents that effectively consider both qualitative market factors and quantitative stock data. We show that our approach results in a strategy that is profitable, robust, and risk-minimal -- outperforming the traditional ensemble strategy as well as single agent algorithms and market metrics. Our findings determine that the conventional practice of switching ensemble agents every fixed-number of months is sub-optimal, and that a dynamic sentiment-based framework greatly unlocks additional performance within these agents. Furthermore, as we have designed our algorithm with simplicity and efficiency in mind, we hypothesize that the transition of our method from historical evaluation towards real-time trading with live data should be relatively simple.
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2402.01441&r=big
  11. By: Nora Bearth; Michael Lechner
    Abstract: It is valuable for any decision maker to know the impact of decisions (treatments) on average and for subgroups. The causal machine learning literature has recently provided tools for estimating group average treatment effects (GATE) to understand treatment heterogeneity better. This paper addresses the challenge of interpreting such differences in treatment effects between groups while accounting for variations in other covariates. We propose a new parameter, the balanced group average treatment effect (BGATE), which measures a GATE with a specific distribution of a priori-determined covariates. By taking the difference of two BGATEs, we can analyse heterogeneity more meaningfully than by comparing two GATEs. The estimation strategy for this parameter is based on double/debiased machine learning for discrete treatments in an unconfoundedness setting, and the estimator is shown to be $\sqrt{N}$-consistent and asymptotically normal under standard conditions. Adding additional identifying assumptions allows specific balanced differences in treatment effects between groups to be interpreted causally, leading to the causal balanced group average treatment effect. We explore the finite sample properties in a small-scale simulation study and demonstrate the usefulness of these parameters in an empirical example.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.08290&r=big
  12. By: Min Dai; Hanqing Jin; Xi Yang
    Abstract: We propose an innovative data-driven option pricing methodology that relies exclusively on the dataset of historical underlying asset prices. While the dataset is rooted in the objective world, option prices are commonly expressed as discounted expectations of their terminal payoffs in a risk-neutral world. Bridging this gap motivates us to identify a pricing kernel process, transforming option pricing into evaluating expectations in the objective world. We recover the pricing kernel by solving a utility maximization problem, and evaluate the expectations in terms of a functional optimization problem. Leveraging the deep learning technique, we design data-driven algorithms to solve both optimization problems over the dataset. Numerical experiments are presented to demonstrate the efficiency of our methodology.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.11158&r=big
  13. By: F. Bolivar; M. A. Duran; A. Lozano-Vivas
    Abstract: To examine the relation between profitability and business models (BMs) across bank sizes, the paper proposes a research strategy based on machine learning techniques. This strategy allows for analyzing whether size and profit performance underlie BM heterogeneity, with BM identification being based on how the components of the bank portfolio contribute to profitability. The empirical exercise focuses on the European Union banking system. Our results suggest that banks with analogous levels of performance and different sizes share strategic features. Additionally, high capital ratios seem compatible with high profitability if banks, relative to their size peers, adopt a standard retail BM.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.12323&r=big
  14. By: Hannes Wallimann; Silvio Sticher
    Abstract: We present a classroom game that integrates economics and data-science competencies. In the first two parts of the game, participants assume the roles of firms in a procurement market, where they must either adopt competitive behaviors or have the option to engage in collusion. Success in these parts hinges on their comprehension of market dynamics. In the third part of the game, participants transition to the role of competition-authority members. Drawing from recent literature on machine-learning-based cartel detection, they analyze the bids for patterns indicative of collusive (cartel) behavior. In this part of the game, success depends on data-science skills. We offer a detailed discussion on implementing the game, emphasizing considerations for accommodating diverging levels of preexisting knowledge in data science.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.14757&r=big
  15. By: Lars Ericson; Xuejun Zhu; Xusi Han; Rao Fu; Shuang Li; Steve Guo; Ping Hu
    Abstract: In the financial services industry, forecasting the risk factor distribution conditional on the history and the current market environment is the key to market risk modeling in general and value at risk (VaR) model in particular. As one of the most widely adopted VaR models in commercial banks, Historical simulation (HS) uses the empirical distribution of daily returns in a historical window as the forecast distribution of risk factor returns in the next day. The objectives for financial time series generation are to generate synthetic data paths with good variety, and similar distribution and dynamics to the original historical data. In this paper, we apply multiple existing deep generative methods (e.g., CGAN, CWGAN, Diffusion, and Signature WGAN) for conditional time series generation, and propose and test two new methods for conditional multi-step time series generation, namely Encoder-Decoder CGAN and Conditional TimeVAE. Furthermore, we introduce a comprehensive framework with a set of KPIs to measure the quality of the generated time series for financial modeling. The KPIs cover distribution distance, autocorrelation and backtesting. All models (HS, parametric and neural networks) are tested on both historical USD yield curve data and additional data simulated from GARCH and CIR processes. The study shows that top performing models are HS, GARCH and CWGAN models. Future research directions in this area are also discussed.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.10370&r=big
  16. By: Kathryn Baragwanath Vogel; Gordon H. Hanson; Amit Khandelwal; Chen Liu; Hogeun Park
    Abstract: This paper integrates daytime and nighttime satellite imagery into a spatial general-equilibrium model to evaluate the returns to investments in new motorways. Our approach has particular value in developing-country settings in which granular data on economic activity are scarce. To demonstrate our method, we use multi-spectral imagery—publicly available across the globe—to evaluate India’s varied road construction projects in the early 2000s. Estimating the model requires only remotely-sensed data, while evaluating welfare impacts requires one year of population data, which are increasingly available through public sources. We find that India’s road investments from this period improved aggregate welfare, particularly for the largest and smallest urban markets. The analysis further reveals that most welfare gains accrued within Indian districts, demonstrating the potential benefits of using of high spatial resolution of satellite images.
    JEL: O1 R1
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:32047&r=big
  17. By: Guy Lacroix; William Arbour
    Abstract: → Read the full report In Quebec, offenders sentenced to more than six months are eligible for parole once they have served one-third of their sentence. About half of eligible offenders choose to waive their right to attend a parole hearing. Why? A new CIRANO study (Lacroix et al., 2023) shows that for some offenders, the decision to renounce is actually a rational one. The results also suggest that parole has a significant impact on social reintegration. The study is based on exclusive administrative data from the Quebec Ministry of Public Security over a period of more than ten years. It is the only study conducted in Quebec that draws robust conclusions through the application of advanced econometric methods and machine learning techniques. → Lire le rapport complet Au Québec, les contrevenants condamnés à plus de six mois sont admissibles à la libération conditionnelle une fois qu’ils ont purgé un tiers de leur peine d’incarcération. Or, environ la moitié des contrevenants admissibles choisissent de renoncer à leur droit de se présenter à une audience pour libération conditionnelle. Pourquoi ? Une nouvelle étude CIRANO (Lacroix et al., 2023) montre que pour certains, la décision de renoncer est en fait rationnelle. Les résultats suggèrent aussi que la libération conditionnelle a des impacts significatifs sur la réinsertion sociale. L’étude s’appuie sur des données administratives exclusives provenant du ministère de la Sécurité publique sur une période de plus de dix ans. C’est la seule étude menée au Québec qui permet de tirer des conclusions robustes par l’application de méthodes économétriques avancées et de techniques d'apprentissage automatique.
    Keywords: Criminal recidivism, Parole, Multivariate regressions, Machine learning, Récidive criminelle, Libération conditionnelle, Régressions multivariées, Apprentissage automatique
    Date: 2024–02–08
    URL: http://d.repec.org/n?u=RePEc:cir:circah:2024pj-01&r=big
  18. By: Riaz Ud Din; Salman Ahmed; Saddam Hussain Khan
    Abstract: Forecasting speculative stock prices is essential for effective investment risk management that drives the need for the development of innovative algorithms. However, the speculative nature, volatility, and complex sequential dependencies within financial markets present inherent challenges which necessitate advanced techniques. This paper proposes a novel framework, CAB-XDE (customized attention BiLSTM-XGB decision ensemble), for predicting the daily closing price of speculative stock Bitcoin-USD (BTC-USD). CAB-XDE framework integrates a customized bi-directional long short-term memory (BiLSTM) with the attention mechanism and the XGBoost algorithm. The customized BiLSTM leverages its learning capabilities to capture the complex sequential dependencies and speculative market trends. Additionally, the new attention mechanism dynamically assigns weights to influential features, thereby enhancing interpretability, and optimizing effective cost measures and volatility forecasting. Moreover, XGBoost handles nonlinear relationships and contributes to the proposed CAB-XDE framework robustness. Additionally, the weight determination theory-error reciprocal method further refines predictions. This refinement is achieved by iteratively adjusting model weights. It is based on discrepancies between theoretical expectations and actual errors in individual customized attention BiLSTM and XGBoost models to enhance performance. Finally, the predictions from both XGBoost and customized attention BiLSTM models are concatenated to achieve diverse prediction space and are provided to the ensemble classifier to enhance the generalization capabilities of CAB-XDE. The proposed CAB-XDE framework is empirically validated on volatile Bitcoin market, sourced from Yahoo Finance and outperforms state-of-the-art models with a MAPE of 0.0037, MAE of 84.40, and RMSE of 106.14.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.11621&r=big
  19. By: Timoth\'ee Fabre; Ioane Muni Toke
    Abstract: We propose a novel approach to marked Hawkes kernel inference which we name the moment-based neural Hawkes estimation method. Hawkes processes are fully characterized by their first and second order statistics through a Fredholm integral equation of the second kind. Using recent advances in solving partial differential equations with physics-informed neural networks, we provide a numerical procedure to solve this integral equation in high dimension. Together with an adapted training pipeline, we give a generic set of hyperparameters that produces robust results across a wide range of kernel shapes. We conduct an extensive numerical validation on simulated data. We finally propose two applications of the method to the analysis of the microstructure of cryptocurrency markets. In a first application we extract the influence of volume on the arrival rate of BTC-USD trades and in a second application we analyze the causality relationships and their directions amongst a universe of 15 cryptocurrency pairs in a centralized exchange.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.09361&r=big
  20. By: David Roodman
    Abstract: Hjort and Poulsen (2019) frames the staggered arrival of submarine Internet cables on the shores of Africa circa 2010 as a difference-in-differences natural experiment in broadband access. The paper finds positive impacts on individual- and firm-level employment and nighttime light emissions. These results are largely ascribable to geocoding errors; to discontinuities from a satellite changeover at end-2009; and to a definition of the treated zone that has unclear technological basis, is narrower than the spatial resolution of nearly all the data sources, and is weakly representative of the geography of broadband availability.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.13694&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.