nep-big New Economics Papers
on Big Data
Issue of 2023‒09‒04
nineteen papers chosen by
Tom Coupé, University of Canterbury

  1. Should we trust web-scraped data? By Jens Foerderer
  2. ChatGPT-based Investment Portfolio Selection By Oleksandr Romanko; Akhilesh Narayan; Roy H. Kwon
  3. Amortized neural networks for agent-based model forecasting By Denis Koshelev; Alexey Ponomarenko; Sergei Seleznev
  4. Financial Fraud Detection: A Comparative Study of Quantum Machine Learning Models By Nouhaila Innan; Muhammad Al-Zafar Khan; Mohamed Bennai
  5. An investigation of auctions in the Regional Greenhouse Gas Initiative By Khezr, Peyman; Pourkhanali, Armin
  6. LOB-Based Deep Learning Models for Stock Price Trend Prediction: A Benchmark Study By Matteo Prata; Giuseppe Masi; Leonardo Berti; Viviana Arrigoni; Andrea Coletta; Irene Cannistraci; Svitlana Vyetrenko; Paola Velardi; Novella Bartolini
  7. EEGNN: edge enhanced graph neural network with a Bayesian nonparametric graph model By Liu, Yirui; Qiao, Xinghao; Wang, Liying; Lam, Jessica
  8. Brands And Chatbots: An Overview Using Machine Learning By Camilo R. Contreras; Pierre Valette-Florence
  9. Deep Policy Gradient Methods in Commodity Markets By Jonas Hanetho
  10. Sig-Splines: universal approximation and convex calibration of time series generative models By Magnus Wiese; Phillip Murray; Ralf Korn
  11. Evaluating Chatbot Performance: A Meta-Analysis Approach with Deep Learning By Jsowd, Kyldo
  12. Is rapid recovery always the best recovery? - Developing a machine learning approach for optimal assignment rules under capacity constraints for knee replacement patients By Cordier, J.;; Salvi, I.;; Steinbeck, V.;; Geissler, A.;; Vogel, J.;
  13. Deep Reinforcement Learning for ESG financial portfolio management By Eduardo C. Garrido-Merch\'an; Sol Mora-Figueroa-Cruz-Guzm\'an; Mar\'ia Coronado-Vaca
  14. The words have power: the impact of news on exchange rates By Teona Shugliashvili
  15. Analysis of bank leverage via dynamical systems and deep neural networks By Lillo, Fabrizio; Livieri, Giulia; Marmi, Stefano; Solomko, Anton; Vaienti, Sandro
  16. A Topic Model for 10-K Management Disclosures By Fengler, Matthias; Phan, Minh Tri
  17. FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models By Yuwei Yin; Yazheng Yang; Jian Yang; Qi Liu
  18. A critical assessment of neural networks as meta-model of a farm optimization model By Seidel, Claudia; Shang, Linmei; Britz, Wolfgang
  19. Thailand Asset Value Estimation Using Aerial or Satellite Imagery By Supawich Puengdang; Worawate Ausawalaithong; Phiratath Nopratanawong; Narongdech Keeratipranon; Chayut Wongkamthong

  1. By: Jens Foerderer
    Abstract: The increasing adoption of econometric and machine-learning approaches by empirical researchers has led to a widespread use of one data collection method: web scraping. Web scraping refers to the use of automated computer programs to access websites and download their content. The key argument of this paper is that na\"ive web scraping procedures can lead to sampling bias in the collected data. This article describes three sources of sampling bias in web-scraped data. More specifically, sampling bias emerges from web content being volatile (i.e., being subject to change), personalized (i.e., presented in response to request characteristics), and unindexed (i.e., abundance of a population register). In a series of examples, I illustrate the prevalence and magnitude of sampling bias. To support researchers and reviewers, this paper provides recommendations on anticipating, detecting, and overcoming sampling bias in web-scraped data.
    Date: 2023–08
  2. By: Oleksandr Romanko; Akhilesh Narayan; Roy H. Kwon
    Abstract: In this paper, we explore potential uses of generative AI models, such as ChatGPT, for investment portfolio selection. Trusting investment advice from Generative Pre-Trained Transformer (GPT) models is a challenge due to model "hallucinations", necessitating careful verification and validation of the output. Therefore, we take an alternative approach. We use ChatGPT to obtain a universe of stocks from S&P500 market index that are potentially attractive for investing. Subsequently, we compared various portfolio optimization strategies that utilized this AI-generated trading universe, evaluating those against quantitative portfolio optimization models as well as comparing to some of the popular investment funds. Our findings indicate that ChatGPT is effective in stock selection but may not perform as well in assigning optimal weights to stocks within the portfolio. But when stocks selection by ChatGPT is combined with established portfolio optimization models, we achieve even better results. By blending strengths of AI-generated stock selection with advanced quantitative optimization techniques, we observed the potential for more robust and favorable investment outcomes, suggesting a hybrid approach for more effective and reliable investment decision-making in the future.
    Date: 2023–08
  3. By: Denis Koshelev; Alexey Ponomarenko; Sergei Seleznev
    Abstract: In this paper, we propose a new procedure for unconditional and conditional forecasting in agent-based models. The proposed algorithm is based on the application of amortized neural networks and consists of two steps. The first step simulates artificial datasets from the model. In the second step, a neural network is trained to predict the future values of the variables using the history of observations. The main advantage of the proposed algorithm is its speed. This is due to the fact that, after the training procedure, it can be used to yield predictions for almost any data without additional simulations or the re-estimation of the neural network
    Date: 2023–08
  4. By: Nouhaila Innan; Muhammad Al-Zafar Khan; Mohamed Bennai
    Abstract: In this research, a comparative study of four Quantum Machine Learning (QML) models was conducted for fraud detection in finance. We proved that the Quantum Support Vector Classifier model achieved the highest performance, with F1 scores of 0.98 for fraud and non-fraud classes. Other models like the Variational Quantum Classifier, Estimator Quantum Neural Network (QNN), and Sampler QNN demonstrate promising results, propelling the potential of QML classification for financial applications. While they exhibit certain limitations, the insights attained pave the way for future enhancements and optimisation strategies. However, challenges exist, including the need for more efficient Quantum algorithms and larger and more complex datasets. The article provides solutions to overcome current limitations and contributes new insights to the field of Quantum Machine Learning in fraud detection, with important implications for its future development.
    Date: 2023–08
  5. By: Khezr, Peyman; Pourkhanali, Armin
    Abstract: The Regional Greenhouse Gas Initiative (RGGI), as the largest cap-and-trade system in the United States, employs quarterly auctions to distribute emissions permits to firms. This study examines firm behavior and auction performance from both theoretical and empirical perspectives. We utilize auction theory to offer theoretical insights regarding the optimal bidding behavior of firms participating in these auctions. Subsequently, we analyze data from the past 58 RGGI auctions to assess the relevant parameters, employing panel random effects and machine learning models. Our findings indicate that most significant policy changes within RGGI, such as the Cost Containment Reserve, positively impacted the auction clearing price. Furthermore, we identify critical parameters, including the number of bidders and the extent of their demand in the auction, demonstrating their influence on the auction clearing price. This paper presents valuable policy insights for all cap-and-trade systems that allocate permits through auctions, as we employ data from an established market to substantiate the efficacy of policies and the importance of specific parameters.
    Keywords: Emissions permit, auctions, uniform-price, RGGI
    JEL: C5 D21 Q5
    Date: 2023–04–24
  6. By: Matteo Prata; Giuseppe Masi; Leonardo Berti; Viviana Arrigoni; Andrea Coletta; Irene Cannistraci; Svitlana Vyetrenko; Paola Velardi; Novella Bartolini
    Abstract: The recent advancements in Deep Learning (DL) research have notably influenced the finance sector. We examine the robustness and generalizability of fifteen state-of-the-art DL models focusing on Stock Price Trend Prediction (SPTP) based on Limit Order Book (LOB) data. To carry out this study, we developed LOBCAST, an open-source framework that incorporates data preprocessing, DL model training, evaluation and profit analysis. Our extensive experiments reveal that all models exhibit a significant performance drop when exposed to new data, thereby raising questions about their real-world market applicability. Our work serves as a benchmark, illuminating the potential and the limitations of current approaches and providing insight for innovative solutions.
    Date: 2023–07
  7. By: Liu, Yirui; Qiao, Xinghao; Wang, Liying; Lam, Jessica
    Abstract: Training deep graph neural networks (GNNs) poses a challenging task, as the performance of GNNs may suffer from the number of hidden message-passing layers. The literature has focused on the proposals of over-smoothing and under-reaching to explain the performance deterioration of deep GNNs. In this paper, we propose a new explanation for such deteriorated performance phenomenon, mis-simplification, that is, mistakenly simplifying graphs by preventing self-loops and forcing edges to be unweighted. We show that such simplifying can reduce the potential of message-passing layers to capture the structural information of graphs. In view of this, we propose a new framework, edge enhanced graph neural network (EEGNN). EEGNN uses the structural information extracted from the proposed Dirichlet mixture Poisson graph model (DMPGM), a Bayesian nonparametric model for graphs, to improve the performance of various deep message-passing GNNs. We propose a Markov chain Monte Carlo inference framework for DMPGM. Experiments over different datasets show that our method achieves considerable performance increase compared to baselines.
    JEL: C1
    Date: 2023
  8. By: Camilo R. Contreras (UGA INP IAE - Grenoble Institut d'Administration des Entreprises - UGA - Université Grenoble Alpes - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes); Pierre Valette-Florence (UGA INP IAE - Grenoble Institut d'Administration des Entreprises - UGA - Université Grenoble Alpes - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes)
    Abstract: As artificial intelligence (AI) and machine learning techniques have evolved to improve Natural Language Processing, human language understanding has enabled human-machine communication tools to be increasingly deployed by brands. Conversational agents or chatbots are among the most widely positioned in recent years of technological evolution, with unprecedented social skills. They have become a cornerstone for supporting brands' interactions with consumers in both digital and physical spaces. Due to the chatbots' massive scientific boom and the relevance, they are gaining for brand management, its practitioners and scholars wake a growing interest in understanding the epistemological map on which this topic is embedded. To discover the main cross-cutting issues, the current and emerging research topics pragmatically. This study proposes using Machine Learning techniques in the scientific production body of this fruitful branch of marketing. Our instruments are twofold; first, we applied Latent Dirichlet Allocation (LDA) to identify eight thematic groups. Second, Dynamic Topic Models (DTM) reveals that the current research streams are oriented to technological advancement. In addition, research on chatbots and brand management is also emerging in two possible directions.
    Keywords: Brand Management, Conversational Agents, Literature Review, Machine Learning
    Date: 2021–11–30
  9. By: Jonas Hanetho
    Abstract: The energy transition has increased the reliance on intermittent energy sources, destabilizing energy markets and causing unprecedented volatility, culminating in the global energy crisis of 2021. In addition to harming producers and consumers, volatile energy markets may jeopardize vital decarbonization efforts. Traders play an important role in stabilizing markets by providing liquidity and reducing volatility. Several mathematical and statistical models have been proposed for forecasting future returns. However, developing such models is non-trivial due to financial markets' low signal-to-noise ratios and nonstationary dynamics. This thesis investigates the effectiveness of deep reinforcement learning methods in commodities trading. It formalizes the commodities trading problem as a continuing discrete-time stochastic dynamical system. This system employs a novel time-discretization scheme that is reactive and adaptive to market volatility, providing better statistical properties for the sub-sampled financial time series. Two policy gradient algorithms, an actor-based and an actor-critic-based, are proposed for optimizing a transaction-cost- and risk-sensitive trading agent. The agent maps historical price observations to market positions through parametric function approximators utilizing deep neural network architectures, specifically CNNs and LSTMs. On average, the deep reinforcement learning models produce an 83 percent higher Sharpe ratio than the buy-and-hold baseline when backtested on front-month natural gas futures from 2017 to 2022. The backtests demonstrate that the risk tolerance of the deep reinforcement learning agents can be adjusted using a risk-sensitivity term. The actor-based policy gradient algorithm performs significantly better than the actor-critic-based algorithm, and the CNN-based models perform slightly better than those based on the LSTM.
    Date: 2023–06
  10. By: Magnus Wiese; Phillip Murray; Ralf Korn
    Abstract: We propose a novel generative model for multivariate discrete-time time series data. Drawing inspiration from the construction of neural spline flows, our algorithm incorporates linear transformations and the signature transform as a seamless substitution for traditional neural networks. This approach enables us to achieve not only the universality property inherent in neural networks but also introduces convexity in the model's parameters.
    Date: 2023–07
  11. By: Jsowd, Kyldo
    Abstract: Chatbot technology has gained significant attention in recent years, with numerous studies focusing on developing and evaluating chatbot performance. However, due to the vast amount of research and the diversity of methodologies employed, it can be challenging to gain a comprehensive understanding of chatbot performance across different domains and applications. In this paper, we propose a meta-analysis approach to evaluate chatbot performance using deep learning techniques. The objective of this study is to systematically analyze and synthesize the findings from existing chatbot performance evaluations, providing a comprehensive assessment of chatbot capabilities and identifying factors that contribute to their success or limitations. To achieve this, we leverage deep learning models to extract valuable insights from a wide range of chatbot evaluation studies.
    Date: 2023–07–16
  12. By: Cordier, J.;; Salvi, I.;; Steinbeck, V.;; Geissler, A.;; Vogel, J.;
    Abstract: Recent research suggests that rapid recovery after knee replacement is beneficial for all patients. Rapid recovery requires timely attention after surgery, yet staff resources are usually limited. Thus, patients with the highest possible health gains from rapid recovery should be identified with the objective to prioritise these patients when assigning rapid recovery capacities. We analyze the effect of optimal assignment rules under different capacity constraints for patients set on the rapid recovery care path using disease specific patient-reported outcomes (KOOS-PS) as measure for effectiveness. Subsequently, we build a policy tree to develop optimal treatment assignment rules. We use patient-reported and observational data from nine German hospitals from 2020/21. We apply a causal forest to estimate the double-robust treatment effects, controlling for patient characteristics. We confirm that on average, after controlling for patient characteristics, patients on the rapid recovery care path experience a significantly larger improvement of their joint functionality than patients on the conventional care path. Using the policy tree, we find that health outcome improvement can be increased on average from 17.87 (observed improvement) to 20.02 on the KOOS-PS scale (0 − 100) without increasing capacity using optimal assignment rules selecting patients for rapid recovery with characteristics linked to higher health gains. Increasing the capacity expects an health outcome improvement of 20.13. We conclude that novel machine learning methods are effective in developing rules for selecting patients for rapid recovery based on their characteristics maximising overall health gains given limited resources. Ultimately, such algorithms should be used for clinical decision making systems as well as surgery and post-surgery capacity planning to work towards the pressing challenges of increasing demand and decreasing supply, driven by demographic change, in today’s hospital sector.
    Date: 2023–08
  13. By: Eduardo C. Garrido-Merch\'an; Sol Mora-Figueroa-Cruz-Guzm\'an; Mar\'ia Coronado-Vaca
    Abstract: This paper investigates the application of Deep Reinforcement Learning (DRL) for Environment, Social, and Governance (ESG) financial portfolio management, with a specific focus on the potential benefits of ESG score-based market regulation. We leveraged an Advantage Actor-Critic (A2C) agent and conducted our experiments using environments encoded within the OpenAI Gym, adapted from the FinRL platform. The study includes a comparative analysis of DRL agent performance under standard Dow Jones Industrial Average (DJIA) market conditions and a scenario where returns are regulated in line with company ESG scores. In the ESG-regulated market, grants were proportionally allotted to portfolios based on their returns and ESG scores, while taxes were assigned to portfolios below the mean ESG score of the index. The results intriguingly reveal that the DRL agent within the ESG-regulated market outperforms the standard DJIA market setup. Furthermore, we considered the inclusion of ESG variables in the agent state space, and compared this with scenarios where such data were excluded. This comparison adds to the understanding of the role of ESG factors in portfolio management decision-making. We also analyze the behaviour of the DRL agent in IBEX 35 and NASDAQ-100 indexes. Both the A2C and Proximal Policy Optimization (PPO) algorithms were applied to these additional markets, providing a broader perspective on the generalization of our findings. This work contributes to the evolving field of ESG investing, suggesting that market regulation based on ESG scoring can potentially improve DRL-based portfolio management, with significant implications for sustainable investing strategies.
    Date: 2023–06
  14. By: Teona Shugliashvili
    Abstract: Using the big data of news texts and a novel, news extended exchange rate model, we investigate the impact of media news on major exchange rates. To present the impact of the U.S. Dollar related news on EUR/USD and GBP/USD, we first use a machine learning model and detect which news topics relate to U.S. Dollar. Next, we calculate the attention to the U.S. Dollar related news topics over time. Eventually, we visualize how Exchange rates react to shocks in the attention to the U.S. Dollar related news topics. The impulse response functions of U.S. Dollar bilateral rates show that exchange rates respond to the U.S. Dollar related news and to the economic uncertainty news shocks with statistical significance in several periods after the shock. Forecast error decomposition documents that 25-27% of exchange rate variation in the long run comes from the news. The results reveal, that news add valuable information to macroeconomic fundamentals for identifying exchange rates, and exchange rates are better identified when both, macroeconomic and news information are used together. These findings are important for exchange rate modeling.
    Keywords: Foreign Exchange, News, Taylor rules, Text mining, LDA, Natural Language Processing (NLP)
    JEL: C55 D80 D84 F31 G14
    Date: 2023–06–02
  15. By: Lillo, Fabrizio; Livieri, Giulia; Marmi, Stefano; Solomko, Anton; Vaienti, Sandro
    Abstract: We consider a model of a simple financial system consisting of a leveraged investor that invests in a risky asset and manages risk by using value-at-risk (VaR). The VaR is estimated by using past data via an adaptive expectation scheme. We show that the leverage dynamics can be described by a dynamical system of slow-fast type associated with a unimodal map on [0, 1] with an additive heteroscedastic noise whose variance is related to the portfolio rebalancing frequency to target leverage. In absence of noise the model is purely deterministic and the parameter space splits into two regions: (i) a region with a globally attracting fixed point or a 2-cycle; (ii) a dynamical core region, where the map could exhibit chaotic behavior. Whenever the model is randomly perturbed, we prove the existence of a unique stationary density with bounded variation, the stochastic stability of the process, and the almost certain existence and continuity of the Lyapunov exponent for the stationary measure. We then use deep neural networks to estimate map parameters from a short time series. Using this method, we estimate the model in a large dataset of US commercial banks over the period 2001-2014. We find that the parameters of a substantial fraction of banks lie in the dynamical core, and their leverage time series are consistent with a chaotic behavior. We also present evidence that the time series of the leverage of large banks tend to exhibit chaoticity more frequently than those of small banks.
    Keywords: leverage cycles; Lyapunov exponents; neural networks; random dynamical systems; risk management; systemic risk; unimodal maps;
    JEL: F3 G3 C1
    Date: 2023
  16. By: Fengler, Matthias; Phan, Minh Tri
    Abstract: We investigate the topics discussed in the Management's Discussion and Analysis (MD&A) section of 10-K filings from January 1994 to December 2018. In our modeling approach, we elicit the MD&A topics by clustering words around a set of anchor words that broadly define a potential topic. From the topics, we extract two hidden loading series from the MD&As - a measure of topic prevalence and a measure of topic sentiment. The results are three-fold. First, the topics we find are intelligible and distinctive but are potentially multi-modal, which may explain why classical topic models applied to 10-K filings often lack interpretability. Second, topic prevalence and sentiment tend to follow trends which, by and large, can be rationalized historically. Third, sentiment affects topics heterogeneously, i.e., in topic-specific ways. Adding to the extant document-level techniques, our study demonstrates the potential benefits of using a nuanced topic-level approach to analyze the MD&A.
    Keywords: 10-K files, MD&A, natural language processing, topic modeling
    JEL: C55 G30 M41
    Date: 2023–08
  17. By: Yuwei Yin; Yazheng Yang; Jian Yang; Qi Liu
    Abstract: Financial risk prediction plays a crucial role in the financial sector. Machine learning methods have been widely applied for automatically detecting potential risks and thus saving the cost of labor. However, the development in this field is lagging behind in recent years by the following two facts: 1) the algorithms used are somewhat outdated, especially in the context of the fast advance of generative AI and large language models (LLMs); 2) the lack of a unified and open-sourced financial benchmark has impeded the related research for years. To tackle these issues, we propose FinPT and FinBench: the former is a novel approach for financial risk prediction that conduct Profile Tuning on large pretrained foundation models, and the latter is a set of high-quality datasets on financial risks such as default, fraud, and churn. In FinPT, we fill the financial tabular data into the pre-defined instruction template, obtain natural-language customer profiles by prompting LLMs, and fine-tune large foundation models with the profile text to make predictions. We demonstrate the effectiveness of the proposed FinPT by experimenting with a range of representative strong baselines on FinBench. The analytical studies further deepen the understanding of LLMs for financial risk prediction.
    Date: 2023–07
  18. By: Seidel, Claudia; Shang, Linmei; Britz, Wolfgang
    Abstract: Mixed Integer programming (MIP) is frequently used in agricultural economics to solve farm-level optimization problems, but it can be computationally intensive especially when the number of binary or integer variables becomes large. In order to speed up simulations, for instance for large-scale sensitivity analysis or application to larger farm populations, meta-models can be derived from the original MIP and applied as an approximator instead. To test and assess this approach, we train Artificial Neural Networks (ANNs) as a meta-model of a farm-scale MIP model. This study compares different ANNs from various perspectives to assess to what extent they are able to replace the original MIP model. Results show that ANNs are promising for meta-modeling as they are computationally efficient and can handle non-linear relationships, corner solutions, and jumpy behavior of the underlying farm optimization model.
    Keywords: Agricultural and Food Policy, Farm Management, Research Methods/ Statistical Methods
    Date: 2023–08–22
  19. By: Supawich Puengdang; Worawate Ausawalaithong; Phiratath Nopratanawong; Narongdech Keeratipranon; Chayut Wongkamthong
    Abstract: Real estate is a critical sector in Thailand's economy, which has led to a growing demand for a more accurate land price prediction approach. Traditional methods of land price prediction, such as the weighted quality score (WQS), are limited due to their reliance on subjective criteria and their lack of consideration for spatial variables. In this study, we utilize aerial or satellite imageries from Google Map API to enhance land price prediction models from the dataset provided by Kasikorn Business Technology Group (KBTG). We propose a similarity-based asset valuation model that uses a Siamese-inspired Neural Network with pretrained EfficientNet architecture to assess the similarity between pairs of lands. By ensembling deep learning and tree-based models, we achieve an area under the ROC curve (AUC) of approximately 0.81, outperforming the baseline model that used only tabular data. The appraisal prices of nearby lands with similarity scores higher than a predefined threshold were used for weighted averaging to predict the reasonable price of the land in question. At 20\% mean absolute percentage error (MAPE), we improve the recall from 59.26\% to 69.55\%, indicating a more accurate and reliable approach to predicting land prices. Our model, which is empowered by a more comprehensive view of land use and environmental factors from aerial or satellite imageries, provides a more precise, data-driven, and adaptive approach for land valuation in Thailand.
    Date: 2023–05

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.