nep-big New Economics Papers
on Big Data
Issue of 2024‒06‒10
nineteen papers chosen by
Tom Coupé, University of Canterbury


  1. The Effect of Data Types' on the Performance of Machine Learning Algorithms for Financial Prediction By Hulusi Mehmet Tanrikulu; Hakan Pabuccu
  2. Mathematics of Differential Machine Learning in Derivative Pricing and Hedging By Pedro Duarte Gomes
  3. Deep learning solutions of DSGE models: A technical report By Pierre Beck; Pablo Garcia-Sanchez; Alban Moura; Julien Pascal; Olivier Pierrard
  4. Interpretable Machine Learning Models for Predicting the Next Targets of Activist Funds By Minwu Kim
  5. Targeted aspect-based emotion analysis to detect opportunities and precaution in financial Twitter messages By Silvia Garc\'ia-M\'endez; Francisco de Arriba-P\'erez; Ana Barros-Vila; Francisco J. Gonz\'alez-Casta\~no
  6. Application and practice of AI technology in quantitative investment By Shuochen Bi; Wenqing Bao; Jue Xiao; Jiangshan Wang; Tingting Deng
  7. Application of Deep Learning for Factor Timing in Asset Management By Prabhu Prasad Panda; Maysam Khodayari Gharanchaei; Xilin Chen; Haoshu Lyu
  8. ECC Analyzer: Extract Trading Signal from Earnings Conference Calls using Large Language Model for Stock Performance Prediction By Yupeng Cao; Zhi Chen; Qingyun Pei; Prashant Kumar; K. P. Subbalakshmi; Papa Momar Ndiaye
  9. Innovative Application of Artificial Intelligence Technology in Bank Credit Risk Management By Shuochen Bi; Wenqing Bao
  10. How do applied researchers use the Causal Forest? A methodological review of a method By Patrick Rehill
  11. DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting By Yihang Fu; Mingyu Zhou; Luyao Zhang
  12. Exploring Recent Ideological Divides in Turkey: Political and Cultural Axes By KINA, MEHMET FUAT
  13. NumLLM: Numeric-Sensitive Large Language Model for Chinese Finance By Huan-Yi Su; Ke Wu; Yu-Hao Huang; Wu-Jun Li
  14. A Comparison of Traditional and Deep Learning Methods for Parameter Estimation of the Ornstein-Uhlenbeck Process By Jacob Fein-Ashley
  15. The Shape of Money Laundering: Subgraph Representation Learning on the Blockchain with the Elliptic2 Dataset By Claudio Bellei; Muhua Xu; Ross Phillips; Tom Robinson; Mark Weber; Tim Kaler; Charles E. Leiserson; Arvind; Jie Chen
  16. Nowcasting the growth rate of the ICT sector By OECD
  17. Ordered Correlation Forest By Riccardo Di Francesco
  18. Algorithmic Bias and Racial Inequality: A Critical Review By Kasy, Maximilian
  19. Occupational reallocation and mismatch in the wake of the Covid-19 pandemic: Cross-country evidence from an online job site By Gabriele Ciminelli; Antton Haramboure; Lea Samek; Cyrille Schwellnus; Allison Shrivastava; Tara Sinclair

  1. By: Hulusi Mehmet Tanrikulu; Hakan Pabuccu
    Abstract: Forecasting cryptocurrencies as a financial issue is crucial as it provides investors with possible financial benefits. A small improvement in forecasting performance can lead to increased profitability; therefore, obtaining a realistic forecast is very important for investors. Successful forecasting provides traders with effective buy-or-hold strategies, allowing them to make more profits. The most important thing in this process is to produce accurate forecasts suitable for real-life applications. Bitcoin, frequently mentioned recently due to its volatility and chaotic behavior, has begun to pay great attention and has become an investment tool, especially during and after the COVID-19 pandemic. This study provided a comprehensive methodology, including constructing continuous and trend data using one and seven years periods of data as inputs and applying machine learning (ML) algorithms to forecast Bitcoin price movement. A binarization procedure was applied using continuous data to construct the trend data representing each input feature trend. Following the related literature, the input features are determined as technical indicators, google trends, and the number of tweets. Random forest (RF), K-Nearest neighbor (KNN), Extreme Gradient Boosting (XGBoost-XGB), Support vector machine (SVM) Naive Bayes (NB), Artificial Neural Networks (ANN), and Long-Short-Term Memory (LSTM) networks were applied on the selected features for prediction purposes. This work investigates two main research questions: i. How does the sample size affect the prediction performance of ML algorithms? ii. How does the data type affect the prediction performance of ML algorithms? Accuracy and area under the ROC curve (AUC) values were used to compare the model performance. A t-test was performed to test the statistical significance of the prediction results.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.19324&r=
  2. By: Pedro Duarte Gomes
    Abstract: This article introduces the groundbreaking concept of the financial differential machine learning algorithm through a rigorous mathematical framework. Diverging from existing literature on financial machine learning, the work highlights the profound implications of theoretical assumptions within financial models on the construction of machine learning algorithms. This endeavour is particularly timely as the finance landscape witnesses a surge in interest towards data-driven models for the valuation and hedging of derivative products. Notably, the predictive capabilities of neural networks have garnered substantial attention in both academic research and practical financial applications. The approach offers a unified theoretical foundation that facilitates comprehensive comparisons, both at a theoretical level and in experimental outcomes. Importantly, this theoretical grounding lends substantial weight to the experimental results, affirming the differential machine learning method's optimality within the prevailing context. By anchoring the insights in rigorous mathematics, the article bridges the gap between abstract financial concepts and practical algorithmic implementations.
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2405.01233&r=
  3. By: Pierre Beck; Pablo Garcia-Sanchez; Alban Moura; Julien Pascal; Olivier Pierrard
    Abstract: This technical report provides an introduction to solving economic models using deep learning techniques. We offer a simple yet rigorous overview of deep learning methods and their applicability to economic modeling. We illustrate these concepts using the benchmark of modern macroeconomic theory: the stochastic growth model. Our results emphasize how various choices related to the design of the deep learning solution affect the accuracy of the results, providing some guidance for potential users of the method. We also provide fully commented computer codes. Overall, our hope is that this report will serve as an accessible, useful entry point to applying deep learning techniques to solve economic models for graduate students and researchers interested in the field.
    Keywords: Solutions of DSGE models, deep learning, artificial neural networks
    JEL: C45 C60 C63 E13
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:bcl:bclwop:bclwp184&r=
  4. By: Minwu Kim
    Abstract: This work develops a predictive model to identify potential targets of activist investment funds, which strategically acquire significant corporate stakes to drive operational and strategic improvements and enhance shareholder value. Predicting these targets is crucial for companies to mitigate intervention risks, for activists to select optimal targets, and for investors to capitalize on associated stock price gains. Our analysis utilizes data from the Russell 3000 index from 2016 to 2022. We tested 123 variations of models using different data imputation, oversampling, and machine learning methods, achieving a top AUC-ROC of 0.782. This demonstrates the model's effectiveness in identifying likely targets of activist funds. We applied the Shapley value method to determine the most influential factors in a company's susceptibility to activist investment. This interpretative approach provides clear insights into the driving forces behind activist targeting. Our model offers stakeholders a strategic tool for proactive corporate governance and investment strategy, enhancing understanding of the dynamics of activist investing.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.16169&r=
  5. By: Silvia Garc\'ia-M\'endez; Francisco de Arriba-P\'erez; Ana Barros-Vila; Francisco J. Gonz\'alez-Casta\~no
    Abstract: Microblogging platforms, of which Twitter is a representative example, are valuable information sources for market screening and financial models. In them, users voluntarily provide relevant information, including educated knowledge on investments, reacting to the state of the stock markets in real-time and, often, influencing this state. We are interested in the user forecasts in financial, social media messages expressing opportunities and precautions about assets. We propose a novel Targeted Aspect-Based Emotion Analysis (TABEA) system that can individually discern the financial emotions (positive and negative forecasts) on the different stock market assets in the same tweet (instead of making an overall guess about that whole tweet). It is based on Natural Language Processing (NLP) techniques and Machine Learning streaming algorithms. The system comprises a constituency parsing module for parsing the tweets and splitting them into simpler declarative clauses; an offline data processing module to engineer textual, numerical and categorical features and analyse and select them based on their relevance; and a stream classification module to continuously process tweets on-the-fly. Experimental results on a labelled data set endorse our solution. It achieves over 90% precision for the target emotions, financial opportunity, and precaution on Twitter. To the best of our knowledge, no prior work in the literature has addressed this problem despite its practical interest in decision-making, and we are not aware of any previous NLP nor online Machine Learning approaches to TABEA.
    Date: 2024–03
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.08665&r=
  6. By: Shuochen Bi; Wenqing Bao; Jue Xiao; Jiangshan Wang; Tingting Deng
    Abstract: With the continuous development of artificial intelligence technology, using machine learning technology to predict market trends may no longer be out of reach. In recent years, artificial intelligence has become a research hotspot in the academic circle, and it has been widely used in image recognition, natural language processing and other fields, and also has a huge impact on the field of quantitative investment. As an investment method to obtain stable returns through data analysis, model construction and program trading, quantitative investment is deeply loved by financial institutions and investors. At the same time, as an important application field of quantitative investment, the quantitative investment strategy based on artificial intelligence technology arises at the historic moment.How to apply artificial intelligence to quantitative investment, so as to better achieve profit and risk control, has also become the focus and difficulty of the research. From a global perspective, inflation in the US and the Federal Reserve are the concerns of investors, which to some extent affects the direction of global assets, including the Chinese stock market. This paper studies the application of AI technology, quantitative investment, and AI technology in quantitative investment, aiming to provide investors with auxiliary decision-making, reduce the difficulty of investment analysis, and help them to obtain higher returns.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.18184&r=
  7. By: Prabhu Prasad Panda; Maysam Khodayari Gharanchaei; Xilin Chen; Haoshu Lyu
    Abstract: The paper examines the performance of regression models (OLS linear regression, Ridge regression, Random Forest, and Fully-connected Neural Network) on the prediction of CMA (Conservative Minus Aggressive) factor premium and the performance of factor timing investment with them. Out-of-sample R-squared shows that more flexible models have better performance in explaining the variance in factor premium of the unseen period, and the back testing affirms that the factor timing based on more flexible models tends to over perform the ones with linear models. However, for flexible models like neural networks, the optimal weights based on their prediction tend to be unstable, which can lead to high transaction costs and market impacts. We verify that tilting down the rebalance frequency according to the historical optimal rebalancing scheme can help reduce the transaction costs.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.18017&r=
  8. By: Yupeng Cao; Zhi Chen; Qingyun Pei; Prashant Kumar; K. P. Subbalakshmi; Papa Momar Ndiaye
    Abstract: In the realm of financial analytics, leveraging unstructured data, such as earnings conference calls (ECCs), to forecast stock performance is a critical challenge that has attracted both academics and investors. While previous studies have used deep learning-based models to obtain a general view of ECCs, they often fail to capture detailed, complex information. Our study introduces a novel framework: \textbf{ECC Analyzer}, combining Large Language Models (LLMs) and multi-modal techniques to extract richer, more predictive insights. The model begins by summarizing the transcript's structure and analyzing the speakers' mode and confidence level by detecting variations in tone and pitch for audio. This analysis helps investors form an overview perception of the ECCs. Moreover, this model uses the Retrieval-Augmented Generation (RAG) based methods to meticulously extract the focuses that have a significant impact on stock performance from an expert's perspective, providing a more targeted analysis. The model goes a step further by enriching these extracted focuses with additional layers of analysis, such as sentiment and audio segment features. By integrating these insights, the ECC Analyzer performs multi-task predictions of stock performance, including volatility, value-at-risk (VaR), and return for different intervals. The results show that our model outperforms traditional analytic benchmarks, confirming the effectiveness of using advanced LLM techniques in financial analytics.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.18470&r=
  9. By: Shuochen Bi; Wenqing Bao
    Abstract: With the rapid growth of technology, especially the widespread application of artificial intelligence (AI) technology, the risk management level of commercial banks is constantly reaching new heights. In the current wave of digitalization, AI has become a key driving force for the strategic transformation of financial institutions, especially the banking industry. For commercial banks, the stability and safety of asset quality are crucial, which directly relates to the long-term stable growth of the bank. Among them, credit risk management is particularly core because it involves the flow of a large amount of funds and the accuracy of credit decisions. Therefore, establishing a scientific and effective credit risk decision-making mechanism is of great strategic significance for commercial banks. In this context, the innovative application of AI technology has brought revolutionary changes to bank credit risk management. Through deep learning and big data analysis, AI can accurately evaluate the credit status of borrowers, timely identify potential risks, and provide banks with more accurate and comprehensive credit decision support. At the same time, AI can also achieve realtime monitoring and early warning, helping banks intervene before risks occur and reduce losses.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.18183&r=
  10. By: Patrick Rehill
    Abstract: This paper conducts a methodological review of papers using the causal forest machine learning method for flexibly estimating heterogeneous treatment effects. It examines 133 peer-reviewed papers. It shows that the emerging best practice relies heavily on the approach and tools created by the original authors of the causal forest such as their grf package and the approaches given by them in examples. Generally researchers use the causal forest on a relatively low-dimensional dataset relying on randomisation or observed controls to identify effects. There are several common ways to then communicate results -- by mapping out the univariate distribution of individual-level treatment effect estimates, displaying variable importance results for the forest and graphing the distribution of treatment effects across covariates that are important either for theoretical reasons or because they have high variable importance. Some deviations from this common practice are interesting and deserve further development and use. Others are unnecessary or even harmful.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.13356&r=
  11. By: Yihang Fu; Mingyu Zhou; Luyao Zhang
    Abstract: In the distributed systems landscape, Blockchain has catalyzed the rise of cryptocurrencies, merging enhanced security and decentralization with significant investment opportunities. Despite their potential, current research on cryptocurrency trend forecasting often falls short by simplistically merging sentiment data without fully considering the nuanced interplay between financial market dynamics and external sentiment influences. This paper presents a novel Dual Attention Mechanism (DAM) for forecasting cryptocurrency trends using multimodal time-series data. Our approach, which integrates critical cryptocurrency metrics with sentiment data from news and social media analyzed through CryptoBERT, addresses the inherent volatility and prediction challenges in cryptocurrency markets. By combining elements of distributed systems, natural language processing, and financial forecasting, our method outperforms conventional models like LSTM and Transformer by up to 20\% in prediction accuracy. This advancement deepens the understanding of distributed systems and has practical implications in financial markets, benefiting stakeholders in cryptocurrency and blockchain technologies. Moreover, our enhanced forecasting approach can significantly support decentralized science (DeSci) by facilitating strategic planning and the efficient adoption of blockchain technologies, improving operational efficiency and financial risk management in the rapidly evolving digital asset domain, thus ensuring optimal resource allocation.
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2405.00522&r=
  12. By: KINA, MEHMET FUAT
    Abstract: This study analyzes Turkey's political landscape by harnessing Computational Social Science techniques to parse extensive data about public ideologies from the Politus database. Unlike existing theoretical framework that considers ideologies of political elites and cadres, this study examines public ideologies in a contentious political manner. It distills eight most prevalent ideologies down to the city level and employs unsupervised machine learning models. The Principal Component Analysis delineates two fundamental axes, the traditional left-right political spectrum and a separate spectrum of secular-religious inclination, namely political and cultural dimensions. Then, the Cluster Analysis reveals three distinct groups: left-leaning and religiously inclined, right-leaning and religiously inclined, and those with centrist views with a pronounced secular focus. The outcomes provide valuable insights into the political and cultural axes within political society, offering a clearer understanding of the most recent ideological and political climate in Turkey.
    Date: 2024–05–03
    URL: http://d.repec.org/n?u=RePEc:osf:socarx:kp7s2&r=
  13. By: Huan-Yi Su; Ke Wu; Yu-Hao Huang; Wu-Jun Li
    Abstract: Recently, many works have proposed various financial large language models (FinLLMs) by pre-training from scratch or fine-tuning open-sourced LLMs on financial corpora. However, existing FinLLMs exhibit unsatisfactory performance in understanding financial text when numeric variables are involved in questions. In this paper, we propose a novel LLM, called numeric-sensitive large language model (NumLLM), for Chinese finance. We first construct a financial corpus from financial textbooks which is essential for improving numeric capability of LLMs during fine-tuning. After that, we train two individual low-rank adaptation (LoRA) modules by fine-tuning on our constructed financial corpus. One module is for adapting general-purpose LLMs to financial domain, and the other module is for enhancing the ability of NumLLM to understand financial text with numeric variables. Lastly, we merge the two LoRA modules into the foundation model to obtain NumLLM for inference. Experiments on financial question-answering benchmark show that NumLLM can boost the performance of the foundation model and can achieve the best overall performance compared to all baselines, on both numeric and non-numeric questions.
    Date: 2024–05
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2405.00566&r=
  14. By: Jacob Fein-Ashley
    Abstract: We consider the Ornstein-Uhlenbeck (OU) process, a stochastic process widely used in finance, physics, and biology. Parameter estimation of the OU process is a challenging problem. Thus, we review traditional tracking methods and compare them with novel applications of deep learning to estimate the parameters of the OU process. We use a multi-layer perceptron to estimate the parameters of the OU process and compare its performance with traditional parameter estimation methods, such as the Kalman filter and maximum likelihood estimation. We find that the multi-layer perceptron can accurately estimate the parameters of the OU process given a large dataset of observed trajectories; however, traditional parameter estimation methods may be more suitable for smaller datasets.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.11526&r=
  15. By: Claudio Bellei; Muhua Xu; Ross Phillips; Tom Robinson; Mark Weber; Tim Kaler; Charles E. Leiserson; Arvind; Jie Chen
    Abstract: Subgraph representation learning is a technique for analyzing local structures (or shapes) within complex networks. Enabled by recent developments in scalable Graph Neural Networks (GNNs), this approach encodes relational information at a subgroup level (multiple connected nodes) rather than at a node level of abstraction. We posit that certain domain applications, such as anti-money laundering (AML), are inherently subgraph problems and mainstream graph techniques have been operating at a suboptimal level of abstraction. This is due in part to the scarcity of annotated datasets of real-world size and complexity, as well as the lack of software tools for managing subgraph GNN workflows at scale. To enable work in fundamental algorithms as well as domain applications in AML and beyond, we introduce Elliptic2, a large graph dataset containing 122K labeled subgraphs of Bitcoin clusters within a background graph consisting of 49M node clusters and 196M edge transactions. The dataset provides subgraphs known to be linked to illicit activity for learning the set of "shapes" that money laundering exhibits in cryptocurrency and accurately classifying new criminal activity. Along with the dataset we share our graph techniques, software tooling, promising early experimental results, and new domain insights already gleaned from this approach. Taken together, we find immediate practical value in this approach and the potential for a new standard in anti-money laundering and forensic analytics in cryptocurrencies and other financial networks.
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2404.19109&r=
  16. By: OECD
    Abstract: This paper details the methodology used to nowcast the growth rate of the information and communication technology (ICT) sector in the "The growth outlook of the ICT sector" chapter of the OECD Digital Economy Outlook 2024, Volume 1. In an era of rapid digital transformation, innovative data sources for economic measurement are crucial. Internet search data have gained prominence for tracking real-time economic activity. This paper details a nowcasting model that leverages Google Trends data to provide policymakers with timely, up-to-date and comparable data on the economic growth of the ICT sector. Having timely data on ICT sector performance is essential to evaluating the effectiveness of sector-related policies. By addressing data challenges and employing a data-driven approach, this paper advances economic measurement of the digitalisation of the economy and provides insights into ICT sector growth dynamics.
    Keywords: digital economy outlook, ICT sector, information and communication technology, nowcast, nowcasting
    Date: 2024–05–14
    URL: http://d.repec.org/n?u=RePEc:oec:stiaab:362-en&r=
  17. By: Riccardo Di Francesco (DEF, University of Rome "Tor Vergata")
    Abstract: Empirical studies in various social sciences often involve categorical outcomes with inherent ordering, such as self-evaluations of subjective well-being and self-assessments in health domains. While ordered choice models, such as the ordered logit and ordered probit, are popular tools for analyzing these outcomes, they may impose restrictive parametric and distributional assumptions. This paper introduces a novel estimator, the ordered correlation forest, that can naturally handle non-linearities in the data and does not assume a specific error term distribution. The proposed estimator modifies a standard random forest splitting criterion to build a collection of forests, each estimating the conditional probability of a single class. Under an “honesty” condition, predictions are consistent and asymptotically normal. The weights induced by each forest are used to obtain standard errors for the predicted probabilities and the covariates’ marginal effects. Evidence from synthetic data shows that the proposed estimator features a superior prediction performance than alternative forest-based estimators and demonstrates its ability to construct valid confidence intervals for the covariates’ marginal effects.
    Keywords: Ordered non-numeric outcomes, choice probabilities, machine learning
    JEL: C14 C25 C55
    Date: 2024–05–06
    URL: http://d.repec.org/n?u=RePEc:rtv:ceisrp:577&r=
  18. By: Kasy, Maximilian (University of Oxford)
    Abstract: Most definitions of algorithmic bias and fairness encode decisionmaker interests, such as profits, rather than the interests of disadvantaged groups (e.g., racial minorities): Bias is defined as a deviation from profit maximization. Future research should instead focus on the causal effect of automated decisions on the distribution of welfare, both across and within groups. The literature emphasizes some apparent contradictions between different notions of fairness, and between fairness and profits. These contradictions vanish, however, when profits are maximized. Existing work involves conceptual slippages between statistical notions of bias and misclassification errors, economic notions of profit, and normative notions of bias and fairness. Notions of bias nonetheless carry some interest within the welfare paradigm that I advocate for, if we understand bias and discrimination as mechanisms and potential points of intervention.
    Keywords: AI, algorithmic bias, inequality, machine learning, discrimination
    JEL: J7 O3
    Date: 2024–04
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp16944&r=
  19. By: Gabriele Ciminelli; Antton Haramboure; Lea Samek; Cyrille Schwellnus; Allison Shrivastava; Tara Sinclair
    Abstract: Employment has recovered strongly from the COVID-19 pandemic despite large structural changes in labour markets, such as the widespread adoption of digital business models and remote work. We analyse whether the pandemic has been associated with labour reallocation across occupations and triggered mismatches between occupational labour demand and supply using novel data on employers’ job postings and jobseekers’ clicks across 19 countries from the online job site Indeed. Findings indicate that, on average across countries, the pandemic triggered large and persistent reallocation of postings and clicks across occupations. Occupational mismatch initially increased but was back to pre-pandemic levels at the end of 2022 as employers and workers adjusted to structural changes. The adjustment was substantially slower in countries that resorted to short-time work schemes to preserve employment during the pandemic.
    Keywords: covid-19 pandemic, occupational mismatch, reallocation
    JEL: E24 J23 J24 G18
    Date: 2024–05–17
    URL: http://d.repec.org/n?u=RePEc:oec:ecoaac:35-en&r=

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.