|
on Big Data |
By: | Yung, Vincent (Northwestern University); Colyvas, Jeannette |
Abstract: | Data wrangling is typically treated as an obligatory, codified, and ideally automated step in the machine learning (ML) pipeline. In contrast, we suggest that archival data wrangling is a theory-driven process best understood as a practical craft. Drawing on empirical examples from contemporary computational social science, we identify nine core modes of data wrangling, which can be seen as a sequence but are iterative and nonlinear in practice. Moreover, we discuss how data wrangling can address issues of algorithmic bias. While ML has shifted the focus towards architectural engineering, we assert that to properly engage with machine learning is to properly engage with data wrangling. |
Date: | 2023–08–18 |
URL: | http://d.repec.org/n?u=RePEc:osf:socarx:2dve6&r=big |
By: | Jiří Witzany; Milan Fičura |
Abstract: | The main goal of this paper is to compare the classical MCMC estimation method with a universal Neural Network (NN) approach to estimate unknown parameters of the Heston stochastic volatility model given a series of observable asset returns. The main idea of the NN approach is to generate a large training synthetic dataset with sampled parameter vectors and the return series conditional on the Heston model. The NN can then be trained reverting the input and output, i.e. setting the return series, or rather a set of derived generalized moments as the input features and the parameters as the target. Once the NN has been trained, the estimation of parameters given observed return series becomes very efficient compared to the MCMC algorithm. Our empirical study implements the MCMC estimation algorithm and demonstrates that the trained NN provides more precise and substantially faster estimations of the Heston model parameters. We discuss some other advantages and disadvantages of the two methods, and hypothesize that the universal NN approach can in general give better results compared to the classical statistical estimation methods for a wide class of models. |
Keywords: | Heston model, parameter estimation, neural networks, MCMC |
JEL: | C45 C63 G13 |
Date: | 2023–07–11 |
URL: | http://d.repec.org/n?u=RePEc:prg:jnlwps:v:5:y:2023:id:5.007&r=big |
By: | Dimitrios Vamvourellis; M\'at\'e Toth; Snigdha Bhagat; Dhruv Desai; Dhagash Mehta; Stefano Pasquali |
Abstract: | Identifying companies with similar profiles is a core task in finance with a wide range of applications in portfolio construction, asset pricing and risk attribution. When a rigorous definition of similarity is lacking, financial analysts usually resort to 'traditional' industry classifications such as Global Industry Classification System (GICS) which assign a unique category to each company at different levels of granularity. Due to their discrete nature, though, GICS classifications do not allow for ranking companies in terms of similarity. In this paper, we explore the ability of pre-trained and finetuned large language models (LLMs) to learn company embeddings based on the business descriptions reported in SEC filings. We show that we can reproduce GICS classifications using the embeddings as features. We also benchmark these embeddings on various machine learning and financial metrics and conclude that the companies that are similar according to the embeddings are also similar in terms of financial performance metrics including return correlation. |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2308.08031&r=big |
By: | Rama K. Malladi |
Abstract: | This study evaluated three Artificial Intelligence (AI) large language model (LLM) enabled platforms - ChatGPT, BARD, and Bing AI - to answer an undergraduate finance exam with 20 quantitative questions across various difficulty levels. ChatGPT scored 30 percent, outperforming Bing AI, which scored 20 percent, while Bard lagged behind with a score of 15 percent. These models faced common challenges, such as inaccurate computations and formula selection. While they are currently insufficient for helping students pass the finance exam, they serve as valuable tools for dedicated learners. Future advancements are expected to overcome these limitations, allowing for improved formula selection and accurate computations and potentially enabling students to score 90 percent or higher. |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2308.07979&r=big |
By: | Yael Hochberg; Ali Kakhbod; Peiyao Li; Kunal Sachdeva |
Abstract: | Implementing a state-of-the-art machine learning technique for causal identification from text data (C-TEXT), we document that patents authored by female inventors are under-cited relative to those authored by males. Relative to what the same patent would be predicted to receive had the lead inventor instead been male, patents with a female lead inventor receive 10% fewer citations. Patents with male lead inventors tend to undercite past patents with female lead inventors, while patent examiners of both genders appear to be more even-handed in the citations they add to patent applications. For female inventors, market-based measures of patent value load significantly on the citation counts that would be predicted by C-TEXT, but do not load significantly on actual forward citations. The under-recognition of female-authored patents likely has implications for the allocation of talent in the economy. |
JEL: | C13 J16 J24 J71 O30 |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:nbr:nberwo:31592&r=big |
By: | Annabelle Mourougane; Polina Knutsson; Rodrigo Pazos; Julia Schmidt; Francesco Palermo |
Abstract: | Trade in value added (TiVA) indicators are increasingly used to monitor countries’ integration into global supply chains. However, they are published with a significant lag - often two or three years - which reduces their relevance for monitoring recent economic developments. This paper aims to provide more timely insights into the international fragmentation of production by exploring new ways of nowcasting five TiVA indicators for the years 2021 and 2022 covering a panel of 41 economies at the economy-wide level and for 24 industry sectors. The analysis relies on a range of models, including Gradient boosted trees (GBM), and other machine-learning techniques, in a panel setting, uses a wide range of explanatory variables capturing domestic business cycles and global economic developments and corrects for publication lags to produce nowcasts in quasi-real time conditions. Resulting nowcasting algorithms significantly improve compared to the benchmark model and exhibit relatively low prediction errors at a one- and two-year horizon, although model performance varies across countries and sectors. |
Keywords: | Global value chains, Machine learning, Nowcasting |
JEL: | C4 C53 F17 |
Date: | 2023–09–06 |
URL: | http://d.repec.org/n?u=RePEc:oec:stdaaa:2023/03-en&r=big |
By: | Ilias Filippou; James Mitchell; My T. Nguyen |
Abstract: | Using close to 40 years of textual data from FOMC transcripts and the Federal Reserve staff's Greenbook/Tealbook, we extend Romer and Romer (2008) to test if the FOMC adds information relative to its staff forecasts not via its own quantitative forecasts but via its words. We use methods from natural language processing to extract from both types of document text-based forecasts that capture attentiveness to and sentiment about the macroeconomy. We test whether these text-based forecasts provide value-added in explaining the distribution of outcomes for GDP growth, the unemployment rate, and inflation. We find that FOMC tales about macroeconomic risks do add value in the tails, especially for GDP growth and the unemployment rate. For inflation, we find value-added in both FOMC point forecasts and narrative, once we extract from the text a broader set of measures of macroeconomic sentiment and risk attentiveness. |
Keywords: | monetary policy; sentiment; uncertainty; risk; forecast evaluation; FOMC meetings; textual analysis; machine learning; quantile regression |
JEL: | F31 G11 G15 |
Date: | 2023–08–30 |
URL: | http://d.repec.org/n?u=RePEc:fip:fedcwq:96636&r=big |
By: | Kleyton da Costa |
Abstract: | Anomaly detection is a challenging task, particularly in systems with many variables. Anomalies are outliers that statistically differ from the analyzed data and can arise from rare events, malfunctions, or system misuse. This study investigated the ability to detect anomalies in global financial markets through Graph Neural Networks (GNN) considering an uncertainty scenario measured by a nonextensive entropy. The main findings show that the complex structure of highly correlated assets decreases in a crisis, and the number of anomalies is statistically different for nonextensive entropy parameters considering before, during, and after crisis. |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2308.02914&r=big |
By: | Turan G. Bali; Bryan T. Kelly; Mathis Mörke; Jamil Rahman |
Abstract: | We propose a statistical model of differences in beliefs in which heterogeneous investors are represented as different machine learning model specifications. Each investor forms return forecasts from their own specific model using data inputs that are available to all investors. We measure disagreement as dispersion in forecasts across investor-models. Our measure aligns with extant measures of disagreement (e.g., analyst forecast dispersion), but is a significantly stronger predictor of future returns. We document a large, significant, and highly robust negative cross-sectional relation between belief disagreement and future returns. A decile spread portfolio that is short stocks with high forecast disagreement and long stocks with low disagreement earns a value-weighted alpha of 15% per year. A range of analyses suggest the alpha is mispricing induced by short-sale costs and limits-to-arbitrage. |
JEL: | C15 C4 C45 C58 G1 G10 G17 G4 G40 |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:nbr:nberwo:31583&r=big |
By: | Mikrajuddin Abdullah |
Abstract: | The presence of a high number of zero flow trades continues to provide a challenge in identifying gravity parameters to explain international trade using the gravity model. Linear regression with a logarithmic linear equation encounters an indefinite value on the logarithmic trade. Although several approaches to solving this problem have been proposed, the majority of them are no longer based on linear regression, making the process of finding solutions more complex. In this work, we suggest a two-step technique for determining the gravity parameters: first, perform linear regression locally to establish a dummy value to substitute trade flow zero, and then estimating the gravity parameters. Iterative techniques are used to determine the optimum parameters. Machine learning is used to test the estimated parameters by analyzing their position in the cluster. We calculated international trade figures for 2004, 2009, 2014, and 2019. We just examine the classic gravity equation and discover that the powers of GDP and distance are in the same cluster and are both worth roughly one. The strategy presented here can be used to solve other problems involving log-linear regression. |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2308.06303&r=big |
By: | Sander van Cranenburgh; Francisco Garrido-Valenzuela |
Abstract: | Visual imagery is indispensable to many multi-attribute decision situations. Examples of such decision situations in travel behaviour research include residential location choices, vehicle choices, tourist destination choices, and various safety-related choices. However, current discrete choice models cannot handle image data and thus cannot incorporate information embedded in images into their representations of choice behaviour. This gap between discrete choice models' capabilities and the real-world behaviour it seeks to model leads to incomplete and, possibly, misleading outcomes. To solve this gap, this study proposes "Computer Vision-enriched Discrete Choice Models" (CV-DCMs). CV-DCMs can handle choice tasks involving numeric attributes and images by integrating computer vision and traditional discrete choice models. Moreover, because CV-DCMs are grounded in random utility maximisation principles, they maintain the solid behavioural foundation of traditional discrete choice models. We demonstrate the proposed CV-DCM by applying it to data obtained through a novel stated choice experiment involving residential location choices. In this experiment, respondents faced choice tasks with trade-offs between commute time, monthly housing cost and street-level conditions, presented using images. As such, this research contributes to the growing body of literature in the travel behaviour field that seeks to integrate discrete choice modelling and machine learning. |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2308.08276&r=big |
By: | Valentina Semenova; Dragos Gorduza; William Wildi; Xiaowen Dong; Stefan Zohren |
Abstract: | A trite yet fundamental question in economics is: What causes large asset price fluctuations? A tenfold rise in the price of GameStop equity, between the 22nd and 28th of January 2021, demonstrated that herding behaviour among retail investors is an important contributing factor. This paper presents a data-driven guide to the forum that started the hype -- WallStreetBets (WSB). Our initial experiments decompose the forum using a large language topic model and network tools. The topic model describes the evolution of the forum over time and shows the persistence of certain topics (such as the market / S\&P500 discussion), and the sporadic interest in others, such as COVID or crude oil. Network analysis allows us to decompose the landscape of retail investors into clusters based on their posting and discussion habits; several large, correlated asset discussion clusters emerge, surrounded by smaller, niche ones. A second set of experiments assesses the impact that WSB discussions have had on the market. We show that forum activity has a Granger-causal relationship with the returns of several assets, some of which are now commonly classified as `meme stocks', while others have gone under the radar. The paper extracts a set of short-term trade signals from posts and long-term (monthly and weekly) trade signals from forum dynamics, and considers their predictive power at different time horizons. In addition to the analysis, the paper presents the dataset, as well as an interactive dashboard, in order to promote further research. |
Date: | 2023–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2308.09485&r=big |
By: | Bonan, Jacopo; Curzi, Daniele; D'Adda, Giovanna; Ferro, Simone |
Abstract: | We employ electricity-use data covering 1, 500, 000 Italian households for 2015–2019 and a granular measure of social media attention to climate change derived from universal-coverage Twitter data to show that increases in climate change salience induced by exogenous sociopolitical and climatic events cause a significant reduction in energy consumption. Sentiment analysis suggests that natural disasters and climate strikes are associated with emotions that are strong motivators for action. These results imply that episodes that draw attention to climate change may lead to actual behavioral change, but their effect is short-lived. |
Date: | 2023–08–24 |
URL: | http://d.repec.org/n?u=RePEc:rff:dpaper:dp-23-34&r=big |
By: | Daouia, Abdelaati; Abbas, Yasser |
Abstract: | psychology, and political and media literature over the last 20 years. Most of these offerings focus on specific qualities or outcomes related to their textual data, which limits their applicability and scope. Instead, we use novel datasets that adopt a more holistic approach to data gathering and text mining, allowing texts to speak for themselves without shackling them with presupposed goals or biases. Our data consists of networks of nodes representing key performance indicators of companies, industries, countries, and events. These nodes are linked by edges weighted by the number of times the concepts were connected in media articles between January 2018 and January 2022. We study these networks through the lens of graph theory and use modularity-based clustering, in the form of the Leiden algorithm, to group nodes into information-filled communities. We showcase the potential of such data by exploring the evolution of our dynamic networks and their metrics over time, which highlights their ability to tell coherent and concise stories about the world economy. |
Keywords: | Dynamic clustering ; graph theory metrics; influential economic actors; written media analysis ; R, Gephi |
Date: | 2023–08–29 |
URL: | http://d.repec.org/n?u=RePEc:tse:wpaper:128419&r=big |