nep-big 2023-10-16 papers

on Big Data

Issue of 2023‒10‒16
29 papers chosen by
Tom Coupé, University of Canterbury

pystacked and ddml: machine learning for prediction and causal inference in Stata By Achim Ahrens; Christian B. Hansen; Mark E. Schaffer; Thomas Wiemann
A compendium of data sources for data science, machine learning, and artificial intelligence By Paul Bilokon; Oleksandr Bilokon; Saeed Amen
gingado: a machine learning library focused on economics and finance By Douglas Kiarelly Godoy de Araujo
Enhancing Healthcare Cost Forecasting: A Machine Learning Model for Resource Allocation in Heterogeneous Regions By Caravaggio, Nicola; Resce, Giuliano
Desenvolvimento de modelo para predi\c{c}\~ao de cota\c{c}\~oes de a\c{c}\~ao baseada em an\'alise de sentimentos de tweets By Mario Mitsuo Akita; Everton Josue da Silva
Predicting Changes in Canadian Housing Markets with Machine Learning By Johan Brannlund; Helen Lao; Maureen MacIsaac; Jing Yang
Deep learning model fragility and implications for financial stability and regulation By Kumar, Rishabh; Koshiyama, Adriano; da Costa, Kleyton; Kingsman, Nigel; Tewarrie, Marvin; Kazim, Emre; Roy, Arunita; Treleaven, Philip; Lovell, Zac
Tasks Makyth Models: Machine Learning Assisted Surrogates for Tipping Points By Gianluca Fabiani; Nikolaos Evangelou; Tianqi Cui; Juan M. Bello-Rivas; Cristina P. Martin-Linares; Constantinos Siettos; Ioannis G. Kevrekidis
GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models By Udit Gupta
Optimizing pessimism in dynamic treatment regimes: a Bayesian learning approach By Zhou, Yunzhe; Qi, Zhengling; Shi, Chengchun; Li, Lexin
Gray-box Adversarial Attack of Deep Reinforcement Learning-based Trading Agents By Foozhan Ataiefard; Hadi Hemmati
Topic Salience and Political Polarization: Evidence from the German “PISA shock” By Pietro Sancassani
Topic Salience and Political Polarization: Evidence from the German “PISA shock” By Pietro Sancassani
A Review of Machine Learning Commands in Stata: Performance and Usability Evaluation By Giovanni Cerulli
Big Data Analytics and Exports - Evidence for Manufacturing Firms from 27 EU Countries By Joachim Wagner
Combining Forecasts under Structural Breaks Using Graphical LASSO By Tae-Hwy Lee; Ekaterina Seregina
Parsimonious Wasserstein Text-mining By Gadat, Sébastien; Villeneuve, Stéphane
Strategic management in public procurement: The role of dynamic capabilities in equity and efficiency By Cappelletti, Matilde; Giuffrida, Leonardo M.; Heaton, Sohvi; Siegel, Donald S.
Detecting Pump-and-Dumps with Crypto-Assets: Dealing with Imbalanced Datasets and Insiders’ Anticipated Purchases By Fantazzini, Dean; Xiao, Yufeng
Generative AI for End-to-End Limit Order Book Modelling: A Token-Level Autoregressive Generative Model of Message Flow Using a Deep State Space Network By Peer Nagy; Sascha Frey; Silvia Sapora; Kang Li; Anisoara Calinescu; Stefan Zohren; Jakob Foerster
Sparse Warcasting By Mihnea Constantinescu
Russia-Ukraine war and G7 debt markets: Evidence from public sentiment towards economic sanctions during the conflict By Zunaidah Sulong; Mohammad Abdullah; Emmanuel J. A. Abakah; David Adeabah; Simplice Asongu
Commodities Trading through Deep Policy Gradient Methods By Jonas Hanetho
Better Routing in Developing Regions : Weather and Satellite-Informed Road Speed Prediction By Stienen, Valentijn; den Hertog, Dick; Wagenaar, Joris; Zegher, J.F.
InvestLM: A Large Language Model for Investment using Financial Domain Instruction Tuning By Yi Yang; Yixuan Tang; Kar Yan Tam
Decoding GPT's hidden "rationality" of cooperation By Bauer, Kevin; Liebich, Lena; Hinz, Oliver; Kosfeld, Michael
Vacancy posting, firm balance sheets, and pandemic policy By Van Dijcke, David; Buckmann, Marcus; Turrell, Arthur; Key, Tomas
Central Bank Mandates and Communication about Climate Change: Evidence from A Large Dataset of Central Bank Speeches By David M. Arseneau; Mitsuhiro Osada
The Returns to Viral Media : The Case of US Campaign Contributions By Boken, Johannes; Draca. Mirko; Mastrorocco, Nicola; Ornaghi, Arianna

pystacked and ddml: machine learning for prediction and causal inference in Stata

By:	Achim Ahrens (ETH Zürich); Christian B. Hansen (University of Chicago); Mark E. Schaffer (Heriot-Watt University); Thomas Wiemann (University of Chicago)
Abstract:	pystacked implements stacked generalization (Wolpert 1992) for regression and binary classification via Python’s scikit-learn. Stacking is an ensemble method that combines multiple supervised machine learners — the "base" or "level-0" learners — into a single learner. The currently-supported base learners include regularized regression (lasso, ridge, elastic net), random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multilayer perceptron). pystacked can also be used to fit a single base learner and thus provides an easy-to-use API for scikit-learn’s machine learning algorithms. ddml implements algorithms for causal inference aided by supervised machine learning as proposed in "Double/debiased machine learning for treatment and structural parameters" (Econometrics Journal 2018). Five different models are supported, allowing for binary or continuous treatment variables and endogeneity in the presence of high-dimensional controls and/or instrumental variables. ddml is compatible with many existing supervised machine learning programs in Stata, and in particular has integrated support for pystacked, making it straightforward to use machine learner ensemble methods in causal inference applications.
Date:	2023–09–10
URL:	http://d.repec.org/n?u=RePEc:boc:lsug23:12&r=big

A compendium of data sources for data science, machine learning, and artificial intelligence

By:	Paul Bilokon; Oleksandr Bilokon; Saeed Amen
Abstract:	Recent advances in data science, machine learning, and artificial intelligence, such as the emergence of large language models, are leading to an increasing demand for data that can be processed by such models. While data sources are application-specific, and it is impossible to produce an exhaustive list of such data sources, it seems that a comprehensive, rather than complete, list would still benefit data scientists and machine learning experts of all levels of seniority. The goal of this publication is to provide just such an (inevitably incomplete) list -- or compendium -- of data sources across multiple areas of applications, including finance and economics, legal (laws and regulations), life sciences (medicine and drug discovery), news sentiment and social media, retail and ecommerce, satellite imagery, and shipping and logistics, and sports.
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.05682&r=big

gingado: a machine learning library focused on economics and finance

By:	Douglas Kiarelly Godoy de Araujo
Abstract:	gingado is an open source Python library that offers a variety of convenience functions and objects to support usage of machine learning in economics research. It is designed to be compatible with widely used machine learning libraries. gingado facilitates augmenting user datasets with relevant data directly obtained from official sources by leveraging the SDMX data and metadata sharing protocol. The library also offers a benchmarking object that creates a random forest with a reasonably good performance out-of-the-box and, if provided with candidate models, retains the one with the best performance. gingado also includes methods to help with machine learning model documentation, including ethical considerations. Further, gingado provides a flexible simulatation of panel datasets with a variety of non-linear causal treatment effects, to support causal model prototyping and benchmarking. The library is under active development and new functionalities are periodically added or improved.
Keywords:	machine learning, open source, data access, documentation
JEL:	C87 C14 C82
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:bis:biswps:1122&r=big

Enhancing Healthcare Cost Forecasting: A Machine Learning Model for Resource Allocation in Heterogeneous Regions

By:	Caravaggio, Nicola; Resce, Giuliano
Abstract:	Accurate forecasting of healthcare costs is essential for making decisions, shaping policies, preparing finances, and managing resources effectively, but traditional econometric models fall short in addressing this policy challenge adequately. This paper introduces machine learning to predict healthcare expenditure in systems with heterogeneous regional needs. The Italian NHS is used as a case study, with administrative data spanning the years 1994 to 2019. The empirical analysis utilises four machine learning algorithms (Elastic-Net, Gradient Boosting, Random Forest, and Support Vector Regression) and a multivariate regression as a baseline. Gradient Boosting emerges as the superior algorithm in out-of-the-sample prediction performances; even when applied to 2019 data, the models trained up to 2018 demonstrate robust forecasting abilities. Important predictors of expenditure include temporal factors, average family size, regional area, GDP per capita, and life expectancy. The remarkable effectiveness of the model demonstrates that machine learning can be efficiently employed to distribute national healthcare funds to areas with heterogeneous needs.
Keywords:	Machine Learning, National Health System, Healthcare expenditure
JEL:	C54 H51 I10
Date:	2023–10–03
URL:	http://d.repec.org/n?u=RePEc:mol:ecsdps:esdp23090&r=big

Desenvolvimento de modelo para predi\c{c}\~ao de cota\c{c}\~oes de a\c{c}\~ao baseada em an\'alise de sentimentos de tweets

By:	Mario Mitsuo Akita; Everton Josue da Silva
Abstract:	Training machine learning models for predicting stock market share prices is an active area of research since the automatization of trading such papers was available in real time. While most of the work in this field of research is done by training Neural networks based on past prices of stock shares, in this work, we use iFeel 2.0 platform to extract 19 sentiment features from posts obtained from microblog platform Twitter that mention the company Petrobras. Then, we used those features to train XBoot models to predict future stock prices for the referred company. Later, we simulated the trading of Petrobras' shares based on the model's outputs and determined the gain of R$88, 82 (net) in a 250-day period when compared to a 100 random models' average performance.
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.06538&r=big

Predicting Changes in Canadian Housing Markets with Machine Learning

By:	Johan Brannlund; Helen Lao; Maureen MacIsaac; Jing Yang
Abstract:	This paper examines whether machine learning (ML) algorithms can outperform a linear model in predicting monthly growth in Canada of both house prices and existing home sales. The aim is to apply two widely used ML techniques (support vector regression and multilayer perceptron) in economic forecasting to understand their scopes and limitations. We find that the two ML algorithms can perform better than a linear model in forecasting house prices and resales. However, the improvement in forecast accuracy is not always statistically significant. Therefore, we cannot systematically conclude using traditional time-series data that the ML models outperform the linear model in a significant way. Future research should explore non-traditional data sets to fully take advantage of ML methods.
Keywords:	Econometric and statistical methods; Financial markets; Housing
JEL:	A C45 C53 R2 R3 D2
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:bca:bocadp:23-21&r=big

Deep learning model fragility and implications for financial stability and regulation

By:	Kumar, Rishabh (Bank of England); Koshiyama, Adriano (University College London); da Costa, Kleyton (University College London); Kingsman, Nigel (University College London); Tewarrie, Marvin (Bank of England); Kazim, Emre (University College London); Roy, Arunita (Reserve Bank of Australia); Treleaven, Philip (University College London); Lovell, Zac (Bank of England)
Abstract:	Deep learning models are being utilised increasingly within finance. Given the models are opaque in nature and are now being deployed for internal and consumer facing decisions, there are increasing concerns around the trustworthiness of their results. We test the stability of predictions and explanations of different deep learning models, which differ between each other only via subtle changes to model settings, with each model trained over the same data. Our results show that the models produce similar predictions but different explanations, even when the differences in model architecture are due to arbitrary factors like random seeds. We compare this behaviour with traditional, interpretable, ‘glass-box models’, which show similar accuracies while maintaining stable explanations and predictions. Finally, we show a methodology based on network analysis to compare deep learning models. Our analysis has implications for the adoption and risk management of future deep learning models by regulated institutions.
Keywords:	Deep neural networks; fragility; robustness; explainability; regulation
JEL:	C45 C52 G18
Date:	2023–09–01
URL:	http://d.repec.org/n?u=RePEc:boe:boeewp:1038&r=big

Tasks Makyth Models: Machine Learning Assisted Surrogates for Tipping Points

By:	Gianluca Fabiani; Nikolaos Evangelou; Tianqi Cui; Juan M. Bello-Rivas; Cristina P. Martin-Linares; Constantinos Siettos; Ioannis G. Kevrekidis
Abstract:	We present a machine learning (ML)-assisted framework bridging manifold learning, neural networks, Gaussian processes, and Equation-Free multiscale modeling, for (a) detecting tipping points in the emergent behavior of complex systems, and (b) characterizing probabilities of rare events (here, catastrophic shifts) near them. Our illustrative example is an event-driven, stochastic agent-based model (ABM) describing the mimetic behavior of traders in a simple financial market. Given high-dimensional spatiotemporal data -- generated by the stochastic ABM -- we construct reduced-order models for the emergent dynamics at different scales: (a) mesoscopic Integro-Partial Differential Equations (IPDEs); and (b) mean-field-type Stochastic Differential Equations (SDEs) embedded in a low-dimensional latent space, targeted to the neighborhood of the tipping point. We contrast the uses of the different models and the effort involved in learning them.
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.14334&r=big

GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models

By:	Udit Gupta
Abstract:	Annual Reports of publicly listed companies contain vital information about their financial health which can help assess the potential impact on Stock price of the firm. These reports are comprehensive in nature, going up to, and sometimes exceeding, 100 pages. Analysing these reports is cumbersome even for a single firm, let alone the whole universe of firms that exist. Over the years, financial experts have become proficient in extracting valuable information from these documents relatively quickly. However, this requires years of practice and experience. This paper aims to simplify the process of assessing Annual Reports of all the firms by leveraging the capabilities of Large Language Models (LLMs). The insights generated by the LLM are compiled in a Quant styled dataset and augmented by historical stock price data. A Machine Learning model is then trained with LLM outputs as features. The walkforward test results show promising outperformance wrt S&P500 returns. This paper intends to provide a framework for future work in this direction. To facilitate this, the code has been released as open source.
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.03079&r=big

Optimizing pessimism in dynamic treatment regimes: a Bayesian learning approach

By:	Zhou, Yunzhe; Qi, Zhengling; Shi, Chengchun; Li, Lexin
Abstract:	In this article, we propose a novel pessimismbased Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-theart solutions through both simulations and a real data example.
JEL:	C1
Date:	2023–01–20
URL:	http://d.repec.org/n?u=RePEc:ehl:lserod:118233&r=big

Gray-box Adversarial Attack of Deep Reinforcement Learning-based Trading Agents

By:	Foozhan Ataiefard; Hadi Hemmati
Abstract:	In recent years, deep reinforcement learning (Deep RL) has been successfully implemented as a smart agent in many systems such as complex games, self-driving cars, and chat-bots. One of the interesting use cases of Deep RL is its application as an automated stock trading agent. In general, any automated trading agent is prone to manipulations by adversaries in the trading environment. Thus studying their robustness is vital for their success in practice. However, typical mechanism to study RL robustness, which is based on white-box gradient-based adversarial sample generation techniques (like FGSM), is obsolete for this use case, since the models are protected behind secure international exchange APIs, such as NASDAQ. In this research, we demonstrate that a "gray-box" approach for attacking a Deep RL-based trading agent is possible by trading in the same stock market, with no extra access to the trading agent. In our proposed approach, an adversary agent uses a hybrid Deep Neural Network as its policy consisting of Convolutional layers and fully-connected layers. On average, over three simulated trading market configurations, the adversary policy proposed in this research is able to reduce the reward values by 214.17%, which results in reducing the potential profits of the baseline by 139.4%, ensemble method by 93.7%, and an automated trading software developed by our industrial partner by 85.5%, while consuming significantly less budget than the victims (427.77%, 187.16%, and 66.97%, respectively).
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.14615&r=big

Topic Salience and Political Polarization: Evidence from the German “PISA shock”

By:	Pietro Sancassani
Abstract:	Does the salience of a topic affect polarization in related parliamentary debates? When discussing a salient topic, politicians might adopt more extreme stances to gain electoral consensus. Alternatively, they could converge towards more moderate positions to find a compromise. Using parliamentary debates from the 16 German state parliaments, I exploit the exogenous increase in the salience of education induced by the unexpectedly low performance of German students in the PISA 2000 test—the German “PISA shock”. I combine machine-learning and text analysis techniques to obtain topic-specific measures of polarization of parliamentary debates. In a difference-in-differences framework, I find that the PISA shock caused an 8.8% of a standard deviation increase in polarization of education debates compared to other topics. The effect is long-lasting and fades after about six years.
Keywords:	Polarization, text analysis, machine learning, Germany, PISA shock
JEL:	D72 D71
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:ces:ifowps:_401&r=big

Topic Salience and Political Polarization: Evidence from the German “PISA shock”

By:	Pietro Sancassani
Abstract:	Does the salience of a topic affect polarization in related parliamentary debates? When discussing a salient topic, politicians might adopt more extreme stances to gain electoral consensus. Alternatively, they could converge towards more moderate positions to find a compromise. Using parliamentary debates from the 16 German state parliaments, I exploit the exogenous increase in the salience of education induced by the unexpectedly low performance of German students in the PISA 2000 test—the German “PISA shock”. I combine machine-learning and text analysis techniques to obtain topic-specific measures of polarization of parliamentary debates. In a difference-in-differences framework, I find that the PISA shock caused an 8.8% of a standard deviation increase in polarization of education debates compared to other topics. The effect is long-lasting and fades after about six years.
Keywords:	Polarization, text analysis, machine learning, Germany, PISA shock
JEL:	D72 D71
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:ces:ifowps:_402&r=big

A Review of Machine Learning Commands in Stata: Performance and Usability Evaluation

By:	Giovanni Cerulli (CNR-IRCRES, National Research Council of Italy, Research Institute on Sustainable Economic Growth)
Abstract:	This paper provides a comprehensive survey reviewing machine learning (ML) commands in Stata. I systematically categorize and summarize the available ML commands in Stata and evaluate their performance and usability for different tasks such as classification, regression, clustering, and dimension reduction. I also provide examples of how to use these commands with real-world datasets and compare their performance. This review aims to help researchers and practitioners choose appropriate ML methods and related Stata tools for their specific research questions and datasets, and to improve the efficiency and reproducibility of ML analyses using Stata. I conclude by discussing some limitations and future directions for ML research in Stata.
Date:	2023–09–10
URL:	http://d.repec.org/n?u=RePEc:boc:lsug23:08&r=big

Big Data Analytics and Exports - Evidence for Manufacturing Firms from 27 EU Countries

By:	Joachim Wagner (Leuphana Universität Lüneburg, Institut für Volkswirtschaftslehre)
Abstract:	The use of big data analytics (including data mining and predictive analytics) by firms can be expected to increase productivity and reduce trade costs, which should be positively related to export activities. This paper uses firm level data from the Flash Eurobarometer 486 survey conducted in February – May 2020 to investigate the link between the use of big data analytics and export activities in manufacturing enterprises from the 27 member countries of the European Union. We find that firms which use big data analytics do more often export, do more often export to various destinations all over the world, and do export to more different destinations. The estimated big data analytics premia for exports are statistically highly significant after controlling for firm size, firm age, patents, and country. Furthermore, the size of these premia can be considered to be large. Successful exporters tend to use big data analytics.
Keywords:	Big data analytics, exports, firm level data, Flash Eurobarometer 486
JEL:	D22 F14
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:lue:wpaper:421&r=big

Combining Forecasts under Structural Breaks Using Graphical LASSO

By:	Tae-Hwy Lee (Department of Economics, University of California Riverside); Ekaterina Seregina (Colby College)
Abstract:	In this paper we develop a novel method of combining many forecasts based on aÂ machine learning algorithm called Graphical LASSO (GL). We visualize forecast errorsÂ from different forecasters as a network of interacting entities and generalize networkÂ inference in the presence of common factor structure and structural breaks. First, weÂ note that forecasters often use common information and hence make common mistakes, Â which makes the forecast errors exhibit common factor structures. We use the FactorÂ Graphical LASSO (FGL, Lee and Seregina (2023)) to separate common forecast errorsÂ from the idiosyncratic errors and exploit sparsity of the precision matrix of the latter.Â Second, since the network of experts changes over time as a response to unstable environmentsÂ such as recessions, it is unreasonable to assume constant forecast combinationÂ weights. Hence, we propose Regime-Dependent Factor Graphical LASSO (RD-FGL)Â that allows factor loadings and idiosyncratic precision matrix to be regime-dependent.Â We develop its scalable implementation using the Alternating Direction Method ofÂ Multipliers (ADMM) to estimate regime-dependent forecast combination weights. TheÂ empirical application to forecasting macroeconomic series using the data of the EuropeanÂ Central Bankâ€™s Survey of Professional Forecasters (ECB SPF) demonstratesÂ superior performance of a combined forecast using FGL and RD-FGL.
Keywords:	Common Forecast Errors, Regime Dependent Forecast Combination, Sparse Precision Matrix of Idiosyncratic Errors, Structural Breaks.
JEL:	C13 C38 C55
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:ucr:wpaper:202310&r=big

Parsimonious Wasserstein Text-mining

By:	Gadat, Sébastien; Villeneuve, Stéphane
Abstract:	This document introduces a parsimonious novel method of processing textual data based on the NMF factorization and on supervised clustering withWasserstein barycenter’s to reduce the dimension of the model. This dual treatment of textual data allows for a representation of a text as a probability distribution on the space of profiles which accounts for both uncertainty and semantic interpretability with the Wasserstein distance. The full textual information of a given period is represented as a random probability measure. This opens the door to a statistical inference method that seeks to predict a financial data using the information generated by the texts of a given period.
Keywords:	Natural Language Processing; Textual Analysis; Wasserstein distance; clustering
Date:	2023–09–20
URL:	http://d.repec.org/n?u=RePEc:tse:wpaper:128497&r=big

Strategic management in public procurement: The role of dynamic capabilities in equity and efficiency

By:	Cappelletti, Matilde; Giuffrida, Leonardo M.; Heaton, Sohvi; Siegel, Donald S.
Abstract:	A key issue in strategic management in the public sector is how government creates economic and social value through procurement. Unfortunately, most procurement studies are based on contract theories, which fail to incorporate the growing role of strategic management in performance. We fill this gap by analyzing longitudinal data on contracting to assess the equity and efficiency effects of a form of affirmative action used by governments: set-aside programs. Employing a machine learning-augmented propensity score weighting approach, we find that set-aside contracts are negatively associated with contract performance. These effects are attenuated by an agency's dynamic capabilities and the extent to which the agency uses more competitive procedures. Our findings illustrate how the dynamic capabilities of a federal agency can simultaneously enhance equity and efficiency.
Keywords:	Dynamic capabilities, resource-based view, public procurement, machine learning, random forest
JEL:	D73 H57 O38 L22
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:zbw:zewdip:23035&r=big

Detecting Pump-and-Dumps with Crypto-Assets: Dealing with Imbalanced Datasets and Insiders’ Anticipated Purchases

By:	Fantazzini, Dean; Xiao, Yufeng
Abstract:	Detecting pump-and-dump schemes involving cryptoassets with high-frequency data is challenging due to imbalanced datasets and the early occurrence of unusual trading volumes. To address these issues, we propose constructing synthetic balanced datasets using resampling methods and flagging a pump-and-dump from the moment of public announcement up to 60 min beforehand. We validated our proposals using data from Pumpolymp and the CryptoCurrency eXchange Trading Library to identify 351 pump signals relative to the Binance crypto exchange in 2021 and 2022. We found that the most effective approach was using the original imbalanced dataset with pump-and-dumps flagged 60 min in advance, together with a random forest model with data segmented into 30-s chunks and regressors computed with a moving window of 1 h. Our analysis revealed that a better balance between sensitivity and specificity could be achieved by simply selecting an appropriate probability threshold, such as setting the threshold close to the observed prevalence in the original dataset. Resampling methods were useful in some cases, but threshold-independent measures were not affected. Moreover, detecting pump-and-dumps in real-time involves high-dimensional data, and the use of resampling methods to build synthetic datasets can be time-consuming, making them less practical.
Keywords:	pump-and-dump; crypto-assets; minority class; class imbalance; machine learning; random forests
JEL:	C14 C25 C35 C38 C51 C53 C58 G17 G32 K42
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:118435&r=big

Generative AI for End-to-End Limit Order Book Modelling: A Token-Level Autoregressive Generative Model of Message Flow Using a Deep State Space Network

By:	Peer Nagy; Sascha Frey; Silvia Sapora; Kang Li; Anisoara Calinescu; Stefan Zohren; Jakob Foerster
Abstract:	Developing a generative model of realistic order flow in financial markets is a challenging open problem, with numerous applications for market participants. Addressing this, we propose the first end-to-end autoregressive generative model that generates tokenized limit order book (LOB) messages. These messages are interpreted by a Jax-LOB simulator, which updates the LOB state. To handle long sequences efficiently, the model employs simplified structured state-space layers to process sequences of order book states and tokenized messages. Using LOBSTER data of NASDAQ equity LOBs, we develop a custom tokenizer for message data, converting groups of successive digits to tokens, similar to tokenization in large language models. Out-of-sample results show promising performance in approximating the data distribution, as evidenced by low model perplexity. Furthermore, the mid-price returns calculated from the generated order flow exhibit a significant correlation with the data, indicating impressive conditional forecast performance. Due to the granularity of generated data, and the accuracy of the model, it offers new application areas for future work beyond forecasting, e.g. acting as a world model in high-frequency financial reinforcement learning applications. Overall, our results invite the use and extension of the model in the direction of autoregressive large financial models for the generation of high-frequency financial data and we commit to open-sourcing our code to facilitate future research.
Date:	2023–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.00638&r=big

Sparse Warcasting

By:	Mihnea Constantinescu (National Bank of Ukraine and University of Amsterdam)
Abstract:	Forecasting economic activity during an invasion is a nontrivial exercise. The lack of timely statistical data and the expected nonlinear effect of military action challenge the use of established nowcasting and shortterm forecasting methodologies. In a recent study (Constantinescu (2023b)), I explore the use of Partial Least Squares (PLS) augmented with an additional variable selection step to nowcast quarterly Ukrainian GDP using Google search data. Model outputs are benchmarked against both static and Dynamic Factor Models. Preliminary results outline the usefulness of PLS in capturing the effects of large shocks in a setting rich in data, but poor in statistics.
Keywords:	Nowcasting; quarterly GDP; Google Trends; Machine Learning; Partial Least Squares; Sparsity; Markov Blanket
JEL:	C38 C53 E32 E37
Date:	2023–09–14
URL:	http://d.repec.org/n?u=RePEc:gii:giihei:heidwp15-2023&r=big

Russia-Ukraine war and G7 debt markets: Evidence from public sentiment towards economic sanctions during the conflict

By:	Zunaidah Sulong (Universiti Sultan Zainal Abidin, Malaysia); Mohammad Abdullah (Universiti Sultan Zainal Abidin, Malaysia); Emmanuel J. A. Abakah (University of Ghana Business School, Accra Ghana); David Adeabah (University of Ghana Business School, Accra Ghana); Simplice Asongu (Yaoundé, Cameroon)
Abstract:	War-related expectations cause changes to investorsâ€™ risks and returns preferences. In this study, we examine the implications of war and sanctions sentiment for the G7 countriesâ€™ debt markets during the Russia-Ukraine war. We use behavioral indicators across social media, news media, and internet attention to reflect the public sentiment from 1st January 2022 to 20th April 2023. We apply the quantile-on-quantile regression (QQR) and rolling window wavelet correlation (RWWC) methods. The quantile-on-quantile regression results show heterogenous impact on fixed income securities. Specifically, extreme public sentiment has a negative impact on G7 fixed income securities return. The wavelets correlation result shows dynamic correlation pattern among public sentiment and fixed income securities. There is a negative relationship between public sentiment and G7 fixed income securities. The correlation is time-varying and highly event dependent. Our additional analysis using corporate bond data indicates the robustness of our findings. Furthermore, the contagion analysis shows public sentiment significantly influence G7 fixed income securities spillover. Our findings can be of great significance while framing strategies for asset allocation, portfolio performance and risk hedging.
Keywords:	Russia-Ukraine war, economic sanctions, G7 debt, fixed income securities, quantile approaches
Date:	2023–01
URL:	http://d.repec.org/n?u=RePEc:agd:wpaper:23/057&r=big

Commodities Trading through Deep Policy Gradient Methods

By:	Jonas Hanetho
Abstract:	Algorithmic trading has gained attention due to its potential for generating superior returns. This paper investigates the effectiveness of deep reinforcement learning (DRL) methods in algorithmic commodities trading. It formulates the commodities trading problem as a continuous, discrete-time stochastic dynamical system. The proposed system employs a novel time-discretization scheme that adapts to market volatility, enhancing the statistical properties of subsampled financial time series. To optimize transaction-cost- and risk-sensitive trading agents, two policy gradient algorithms, namely actor-based and actor-critic-based approaches, are introduced. These agents utilize CNNs and LSTMs as parametric function approximators to map historical price observations to market positions.Backtesting on front-month natural gas futures demonstrates that DRL models increase the Sharpe ratio by $83\%$ compared to the buy-and-hold baseline. Additionally, the risk profile of the agents can be customized through a hyperparameter that regulates risk sensitivity in the reward function during the optimization process. The actor-based models outperform the actor-critic-based models, while the CNN-based models show a slight performance advantage over the LSTM-based models.
Date:	2023–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.00630&r=big

Better Routing in Developing Regions : Weather and Satellite-Informed Road Speed Prediction

By:	Stienen, Valentijn (Tilburg University, Center For Economic Research); den Hertog, Dick (Tilburg University, Center For Economic Research); Wagenaar, Joris (Tilburg University, Center For Economic Research); Zegher, J.F.
Keywords:	trafic speed; Road attribute prediction; (Convolutional) neural network; Satellite imagery; Weather information
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:tiu:tiucen:de5c3c6d-44ee-45cf-b207-c748ae918bd5&r=big

InvestLM: A Large Language Model for Investment using Financial Domain Instruction Tuning

By:	Yi Yang; Yixuan Tang; Kar Yan Tam
Abstract:	We present a new financial domain large language model, InvestLM, tuned on LLaMA-65B (Touvron et al., 2023), using a carefully curated instruction dataset related to financial investment. Inspired by less-is-more-for-alignment (Zhou et al., 2023), we manually curate a small yet diverse instruction dataset, covering a wide range of financial related topics, from Chartered Financial Analyst (CFA) exam questions to SEC filings to Stackexchange quantitative finance discussions. InvestLM shows strong capabilities in understanding financial text and provides helpful responses to investment related questions. Financial experts, including hedge fund managers and research analysts, rate InvestLM's response as comparable to those of state-of-the-art commercial models (GPT-3.5, GPT-4 and Claude-2). Zero-shot evaluation on a set of financial NLP benchmarks demonstrates strong generalizability. From a research perspective, this work suggests that a high-quality domain specific LLM can be tuned using a small set of carefully curated instructions on a well-trained foundation model, which is consistent with the Superficial Alignment Hypothesis (Zhou et al., 2023). From a practical perspective, this work develops a state-of-the-art financial domain LLM with superior capability in understanding financial texts and providing helpful investment advice, potentially enhancing the work efficiency of financial professionals. We release the model parameters to the research community.
Date:	2023–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2309.13064&r=big

Decoding GPT's hidden "rationality" of cooperation

By:	Bauer, Kevin; Liebich, Lena; Hinz, Oliver; Kosfeld, Michael
Abstract:	In current discussions on large language models (LLMs) such as GPT, understanding their ability to emulate facets of human intelligence stands central. Using behavioral economic paradigms and structural models, we investigate GPT's cooperativeness in human interactions and assess its rational goal-oriented behavior. We discover that GPT cooperates more than humans and has overly optimistic expectations about human cooperation. Intriguingly, additional analyses reveal that GPT's behavior isn't random; it displays a level of goal-oriented rationality surpassing human counterparts. Our findings suggest that GPT hyper-rationally aims to maximize social welfare, coupled with a strive of self-preservation. Methodologically, our research highlights how structural models, typically employed to decipher human behavior, can illuminate the rationality and goal-orientation of LLMs. This opens a compelling path for future research into the intricate rationality of sophisticated, yet enigmatic artificial agents.
Keywords:	large language models, cooperation, goal orientation, economic rationality
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:zbw:safewp:401&r=big

Vacancy posting, firm balance sheets, and pandemic policy

By:	Van Dijcke, David (Department of Economics, University of Michigan); Buckmann, Marcus (Bank of England); Turrell, Arthur (Data Science Campus, Office for National Statistics); Key, Tomas (Bank of England)
Abstract:	We assess how balance sheets propagated labour demand shocks during Covid-19 using novel matched data on firms and online job postings. Exploiting regional and firm-level variation in three pandemic policies in the UK, we find that financially healthy firms increased vacancies more in response to positive shocks. Less-leveraged firms and firms with higher credit scores increased postings more in response to the Eat Out to Help Out’s local demand subsidies and after receiving a Bounce Back Loan Scheme loan, respectively. These findings complement the link between leverage and employment losses in response to negative shocks.
Keywords:	Covid-19; recession; vacancies; Indeed; job postings; job ads; heterogeneity; firm; firm-level; balance sheets; industry; big data; alternative data; labour market; natural language processing
JEL:	C50 D20 G30 H10 J20 J60
Date:	2023–07–21
URL:	http://d.repec.org/n?u=RePEc:boe:boeewp:1033&r=big

Central Bank Mandates and Communication about Climate Change: Evidence from A Large Dataset of Central Bank Speeches

By:	David M. Arseneau (Federal Reserve Board); Mitsuhiro Osada (Bank of Japan)
Abstract:	We compare alternative methodologies to identify central banks speeches that focus on climate change and argue a supervised word scoring method produces the most comprehensive set. Using these climate-related speeches, we empirically examine the role of the mandate in shaping central bank communication about climate change. Central banks differ considerably in the extent to which their mandates support a sustainability objective -- it can be explicit, indirect whereby the central bank is mandated to support broader government policies, or it may not be supported at all. Our results show that these differences are important in determining the frequency of climate-related communication as well as context in which central banks address climate-related issues. All told, these findings suggest that mandate considerations play an important role in shaping central bank communication about climate change.
Keywords:	Central bank speeches; Mandates; Climate change; Natural language processing
JEL:	E58 E61 Q54
Date:	2023–09–29
URL:	http://d.repec.org/n?u=RePEc:boj:bojwps:wp23e14&r=big

The Returns to Viral Media : The Case of US Campaign Contributions

By:	Boken, Johannes (University of Warwick); Draca. Mirko (University of Warwick); Mastrorocco, Nicola (University of Warwick); Ornaghi, Arianna (Hertie School)
Abstract:	Social media has changed the structure of mass communication. In this paper we explore its role in influencing political donations. Using a daily dataset of campaign contributions and Twitter activity for US Members of Congress 2019-2020, we find that attention on Twitter (as measured by likes) is positively correlated with the amount of daily small donations received. However, this is not true for everybody : the impact on campaign donations is highly skewed, indicating very concentrated returns to attention that are in line with a ‘winner-takes-all’ market. Our results are confirmed in a geography-based causal design linking member’s donations across states.
Keywords:	Social Media ; Twitter ; Campaign Contributions JEL Codes: D72 ; P00
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:wrk:warwec:1472&r=big

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.