nep-big 2024-05-20 papers

on Big Data

Issue of 2024‒05‒20
seventeen papers chosen by
Tom Coupé, University of Canterbury

Detection of financial opportunities in micro-blogging data with a stacked classification system By Francisco de Arriba-P\'erez; Silvia Garc\'ia-M\'endez; Jos\'e A. Regueiro-Janeiro; Francisco J. Gonz\'alez-Casta\~no
Emissions from Military Training: Evidence from Australia By Lee, Wang-Sheng; Tran, Trang My
Companies with at least 10 Employees Selling Online across the Italian Regions By Leogrande, Angelo
Explaining Indian Stock Market through Geometry of Scale free Networks By Pawanesh Yadav; Charu Sharma; Niteesh Sahni
Experimental Analysis of Deep Hedging Using Artificial Market Simulations for Underlying Asset Simulators By Masanori Hirano
Machine learning and economic forecasting: the role of international trade networks By Thiago C. Silva; Paulo V. B. Wilhelm; Diego R. Amancio
Nowcasting consumer price inflation using high-frequency scanner data: evidence from Germany By Beck, Günter W.; Carstensen, Kai; Menz, Jan-Oliver; Schnorrenberger, Richard; Wieland, Elisabeth
Cultural and Creative Employment Across Italian Regions By Leogrande, Angelo
A High-Frequency Digital Economy Index: Text Analysis and Factor Analysis based on Big Data By Xu, Yonghong; Su, Bingjie; Pan, Wenjie; Zhou, Peng
Analyzing Economic Convergence Across the Americas: A Survival Analysis Approach to GDP per Capita Trajectories By Diego Vallarino
Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training By Masanori Hirano; Kentaro Imajo
Evaluating the Quality of Answers in Political Q&A Sessions with Large Language Models By R. Michael Alvarez; Jacob Morrier
Enhancing Financial Data Visualization for Investment Decision-Making By Nisarg Patel; Harmit Shah; Kishan Mewada
Situational awareness in big data environment: Insights from French Police decision makers By Jordan Vazquez Llana; Cécile Godé; Jean-Fabrice Lebraty
Generative AI Tools zur Prognose von Leitzins-Entscheidungen: eine Fallstudie am Beispiel der Leitzinsentscheidungen der Federal Reserve By Daube, Carl Heinz; Krivenkov, Vladislav
Recovering Overlooked Information in Categorical Variables with LLMs: An Application to Labor Market Mismatch By Yi Chen; Hanming Fang; Yi Zhao; Zibo Zhao
Automated Social Science: Language Models as Scientist and Subjects By Benjamin S. Manning; Kehang Zhu; John J. Horton

Detection of financial opportunities in micro-blogging data with a stacked classification system

By:	Francisco de Arriba-P\'erez; Silvia Garc\'ia-M\'endez; Jos\'e A. Regueiro-Janeiro; Francisco J. Gonz\'alez-Casta\~no
Abstract:	Micro-blogging sources such as the Twitter social network provide valuable real-time data for market prediction models. Investors' opinions in this network follow the fluctuations of the stock markets and often include educated speculations on market opportunities that may have impact on the actions of other investors. In view of this, we propose a novel system to detect positive predictions in tweets, a type of financial emotions which we term "opportunities" that are akin to "anticipation" in Plutchik's theory. Specifically, we seek a high detection precision to present a financial operator a substantial amount of such tweets while differentiating them from the rest of financial emotions in our system. We achieve it with a three-layer stacked Machine Learning classification system with sophisticated features that result from applying Natural Language Processing techniques to extract valuable linguistic information. Experimental results on a dataset that has been manually annotated with financial emotion and ticker occurrence tags demonstrate that our system yields satisfactory and competitive performance in financial opportunity detection, with precision values up to 83%. This promising outcome endorses the usability of our system to support investors' decision making.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.07224&r=big

Emissions from Military Training: Evidence from Australia

By:	Lee, Wang-Sheng (Monash University); Tran, Trang My (Monash University)
Abstract:	Environmental research related to military activities and warfare is sparse and fragmented by discipline. Although achieving military objectives will likely continue to trump any concerns related to the environment during active conflict, military training during peacetime has environmental consequences. This research aims to quantify how much pollution is emitted during regular military exercises which has implications for climate change. Focusing on major military training exercises conducted in Australia, we assess the impact of four international exercises held within a dedicated military training area on pollution levels. Leveraging high-frequency data, we employ a machine learning algorithm in conjunction with program evaluation techniques to estimate the effects of military training activities. Our main approach involves generating counterfactual predictions and utilizing a "prediction-error" framework to estimate treatment effects by comparing a treatment area to a control area. Our findings reveal that these exercises led to a notable increase in air pollution levels, potentially reaching up to 25% relative to mean levels during peak training hours.
Keywords:	machine learning, military emissions, military training, pollution
JEL:	C55 Q53 Q54
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:iza:izadps:dp16889&r=big

Companies with at least 10 Employees Selling Online across the Italian Regions

By:	Leogrande, Angelo
Abstract:	The following article analyzes Italian companies with more than 10 employees that use online sales tools. The data used were acquired from the ISTAT-BES database. The article first presents a static analysis of the data aimed at framing the phenomenon in the context of Italian regional disparities. Subsequently, a clustering with k-Means algorithm is proposed by comparing the Silhouette coefficient and the Elbow method. The investigation of the innovative and technological determinants of the observed variable is carried out through the application of a panel econometric model. Finally, different machine learning algorithms for prediction are compared. The results are critically discussed with economic policy suggestions.
Keywords:	Innovation, Innovation and Invention, Management of Technological Innovation and R&D, Technological Change, Intellectual Property and Intellectual Capital.
JEL:	O30 O31 O32 O33 O34
Date:	2024–04–05
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:120637&r=big

Explaining Indian Stock Market through Geometry of Scale free Networks

By:	Pawanesh Yadav; Charu Sharma; Niteesh Sahni
Abstract:	This paper presents an analysis of the Indian stock market using a method based on embedding the network in a hyperbolic space using Machine learning techniques. We claim novelty on four counts. First, it is demonstrated that the hyperbolic clusters resemble the topological network communities more closely than the Euclidean clusters. Second, we are able to clearly distinguish between periods of market stability and volatility through a statistical analysis of hyperbolic distance and hyperbolic shortest path distance corresponding to the embedded network. Third, we demonstrate that using the modularity of the embedded network significant market changes can be spotted early. Lastly, the coalescent embedding is able to segregate the certain market sectors thereby underscoring its natural clustering ability.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.04710&r=big

Experimental Analysis of Deep Hedging Using Artificial Market Simulations for Underlying Asset Simulators

By:	Masanori Hirano
Abstract:	Derivative hedging and pricing are important and continuously studied topics in financial markets. Recently, deep hedging has been proposed as a promising approach that uses deep learning to approximate the optimal hedging strategy and can handle incomplete markets. However, deep hedging usually requires underlying asset simulations, and it is challenging to select the best model for such simulations. This study proposes a new approach using artificial market simulations for underlying asset simulations in deep hedging. Artificial market simulations can replicate the stylized facts of financial markets, and they seem to be a promising approach for deep hedging. We investigate the effectiveness of the proposed approach by comparing its results with those of the traditional approach, which uses mathematical finance models such as Brownian motion and Heston models for underlying asset simulations. The results show that the proposed approach can achieve almost the same level of performance as the traditional approach without mathematical finance models. Finally, we also reveal that the proposed approach has some limitations in terms of performance under certain conditions.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.09462&r=big

Machine learning and economic forecasting: the role of international trade networks

By:	Thiago C. Silva; Paulo V. B. Wilhelm; Diego R. Amancio
Abstract:	This study examines the effects of de-globalization trends on international trade networks and their role in improving forecasts for economic growth. Using section-level trade data from nearly 200 countries from 2010 to 2022, we identify significant shifts in the network topology driven by rising trade policy uncertainty. Our analysis highlights key global players through centrality rankings, with the United States, China, and Germany maintaining consistent dominance. Using a horse race of supervised regressors, we find that network topology descriptors evaluated from section-specific trade networks substantially enhance the quality of a country's GDP growth forecast. We also find that non-linear models, such as Random Forest, XGBoost, and LightGBM, outperform traditional linear models used in the economics literature. Using SHAP values to interpret these non-linear model's predictions, we find that about half of most important features originate from the network descriptors, underscoring their vital role in refining forecasts. Moreover, this study emphasizes the significance of recent economic performance, population growth, and the primary sector's influence in shaping economic growth predictions, offering novel insights into the intricacies of economic growth forecasting.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.08712&r=big

Nowcasting consumer price inflation using high-frequency scanner data: evidence from Germany

By:	Beck, Günter W.; Carstensen, Kai; Menz, Jan-Oliver; Schnorrenberger, Richard; Wieland, Elisabeth
Abstract:	We study how millions of granular and weekly household scanner data combined with machine learning can help to improve the real-time nowcast of German inflation. Our nowcasting exercise targets three hierarchy levels of inflation: individual products, product groups, and headline inflation. At the individual product level, we construct a large set of weekly scanner-based price indices that closely match their official counterparts, such as butter and coffee beans. Within a mixed-frequency setup, these indices significantly improve inflation nowcasts already after the first seven days of a month. For nowcasting product groups such as processed and unprocessed food, we apply shrinkage estimators to exploit the large set of scanner-based price indices, resulting in substantial predictive gains over autoregressive time series models. Finally, by adding high-frequency information on energy and travel services, we construct competitive nowcasting models for headline inflation that are on par with, or even outperform, survey-based inflation expectations. JEL Classification: E31, C55, E37, C53
Keywords:	inflation nowcasting, machine learning methods, mixed-frequency modeling, scanner price data
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:ecb:ecbwps:20242930&r=big

Cultural and Creative Employment Across Italian Regions

By:	Leogrande, Angelo
Abstract:	in the following article I analyze the trend of cultural and creative employment in the Italian regions between 2004 and 2022 through the use of ISTAT-BES data. After presenting a static analysis, I also present the results of the clustering analysis aimed at identifying groupings between Italian regions. Subsequently, an econometric model is proposed for estimating the value of cultural and creative employment in the Italian regions. Finally, I compare various machine learning models for predicting the value of cultural and creative employment. The results are critically discussed through an economic policy analysis.
Date:	2024–04–01
URL:	http://d.repec.org/n?u=RePEc:osf:socarx:h5nq4&r=big

A High-Frequency Digital Economy Index: Text Analysis and Factor Analysis based on Big Data

By:	Xu, Yonghong; Su, Bingjie; Pan, Wenjie; Zhou, Peng (Cardiff Business School)
Abstract:	We propose a high-frequency digital economy index by combining official white papers and big data. It aims to resolve the discrepancy between the new economic reality and old economic indicators used by decision-makers and policymakers. We have demonstrated a significant effect due to keyword rotations on the indices. Further analysis of the Dagum-Gini coefficient shows that spatial heterogeneity and temporal variation of the digital economy indices can be mainly attributed to between-group inequality.
Keywords:	Digital Economy; High-Frequency Index; Big Data; Text Analysis; Hierarchical Dynamic Factor Model
JEL:	O33 O53 C38
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:cdf:wpaper:2024/11&r=big

Analyzing Economic Convergence Across the Americas: A Survival Analysis Approach to GDP per Capita Trajectories

By:	Diego Vallarino
Abstract:	By integrating survival analysis, machine learning algorithms, and economic interpretation, this research examines the temporal dynamics associated with attaining a 5 percent rise in purchasing power parity-adjusted GDP per capita over a period of 120 months (2013-2022). A comparative investigation reveals that DeepSurv is proficient at capturing non-linear interactions, although standard models exhibit comparable performance under certain circumstances. The weight matrix evaluates the economic ramifications of vulnerabilities, risks, and capacities. In order to meet the GDPpc objective, the findings emphasize the need of a balanced approach to risk-taking, strategic vulnerability reduction, and investment in governmental capacities and social cohesiveness. Policy guidelines promote individualized approaches that take into account the complex dynamics at play while making decisions.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.04282&r=big

Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training

By:	Masanori Hirano; Kentaro Imajo
Abstract:	Large language models (LLMs) are now widely used in various fields, including finance. However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study aims to construct a Japanese financial-specific LLM through continual pre-training. Before tuning, we constructed Japanese financial-focused datasets for continual pre-training. As a base model, we employed a Japanese LLM that achieved state-of-the-art performance on Japanese financial benchmarks among the 10-billion-class parameter models. After continual pre-training using the datasets and the base model, the tuned model performed better than the original model on the Japanese financial benchmarks. Moreover, the outputs comparison results reveal that the tuned model's outputs tend to be better than the original model's outputs in terms of the quality and length of the answers. These findings indicate that domain-specific continual pre-training is also effective for LLMs. The tuned model is publicly available on Hugging Face.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.10555&r=big

Evaluating the Quality of Answers in Political Q&A Sessions with Large Language Models

By:	R. Michael Alvarez; Jacob Morrier
Abstract:	This paper presents a new approach to evaluating the quality of answers in political question-and-answer sessions. We propose to measure an answer's quality based on the degree to which it allows us to infer the initial question accurately. This conception of answer quality inherently reflects their relevance to initial questions. Drawing parallels with semantic search, we argue that this measurement approach can be operationalized by fine-tuning a large language model on the observed corpus of questions and answers without additional labeled data. We showcase our measurement approach within the context of the Question Period in the Canadian House of Commons. Our approach yields valuable insights into the correlates of the quality of answers in the Question Period. We find that answer quality varies significantly based on the party affiliation of the members of Parliament asking the questions and uncover a meaningful correlation between answer quality and the topics of the questions.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.08816&r=big

Enhancing Financial Data Visualization for Investment Decision-Making

By:	Nisarg Patel; Harmit Shah; Kishan Mewada
Abstract:	Navigating the intricate landscape of financial markets requires adept forecasting of stock price movements. This paper delves into the potential of Long Short-Term Memory (LSTM) networks for predicting stock dynamics, with a focus on discerning nuanced rise and fall patterns. Leveraging a dataset from the New York Stock Exchange (NYSE), the study incorporates multiple features to enhance LSTM's capacity in capturing complex patterns. Visualization of key attributes, such as opening, closing, low, and high prices, aids in unraveling subtle distinctions crucial for comprehensive market understanding. The meticulously crafted LSTM input structure, inspired by established guidelines, incorporates both price and volume attributes over a 25-day time step, enabling the model to capture temporal intricacies. A comprehensive methodology, including hyperparameter tuning with Grid Search, Early Stopping, and Callback mechanisms, leads to a remarkable 53% improvement in predictive accuracy. The study concludes with insights into model robustness, contributions to financial forecasting literature, and a roadmap for real-time stock market prediction. The amalgamation of LSTM networks, strategic hyperparameter tuning, and informed feature selection presents a potent framework for advancing the accuracy of stock price predictions, contributing substantively to financial time series forecasting discourse.
Date:	2023–12
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2403.18822&r=big

Situational awareness in big data environment: Insights from French Police decision makers

By:	Jordan Vazquez Llana (LARSH - Laboratoire de Recherche Sociétés & Humanités - UPHF - Université Polytechnique Hauts-de-France - INSA Hauts-De-France - INSA Institut National des Sciences Appliquées Hauts-de-France - INSA - Institut National des Sciences Appliquées); Cécile Godé (CERGAM - Centre d'Études et de Recherche en Gestion d'Aix-Marseille - AMU - Aix Marseille Université - UTLN - Université de Toulon); Jean-Fabrice Lebraty (Laboratoire de Recherche Magellan - UJML - Université Jean Moulin - Lyon 3 - Université de Lyon - Institut d'Administration des Entreprises (IAE) - Lyon)
Date:	2022–06–30
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-04525898&r=big

Generative AI Tools zur Prognose von Leitzins-Entscheidungen: eine Fallstudie am Beispiel der Leitzinsentscheidungen der Federal Reserve

By:	Daube, Carl Heinz; Krivenkov, Vladislav
Abstract:	Dieses Working Paper untersucht den Einsatz von Generative AI Anwendungen zur Prognose von Leitzinsentscheidungen der Federal Reserve. Es bewertet, ob diese Anwendungen eingesetzt werden können, um Leitzinsänderungen vorherzusagen, und vergleicht ihre Vorhersagegenauigkeit mit den Markterwartungen über einen Zeitraum von sechs Monaten.
Abstract:	This working paper investigates the use of generative AI applications for predicting interest rate decisions by the Federal Reserve. It assesses whether these applications can be utilized to forecast changes in interest rates and compares their predictive accuracy with market expectations over a period of six months.
Keywords:	Künstliche Intelligenz, FED Leitzinsentscheidungen, large language model (LLM), natural language processing (NLP), Generative AI Anwendung ChatGPT, Generative AI AnwendungW Google Gemini
JEL:	G0
Date:	2024
URL:	http://d.repec.org/n?u=RePEc:zbw:esprep:293992&r=big

Recovering Overlooked Information in Categorical Variables with LLMs: An Application to Labor Market Mismatch

By:	Yi Chen; Hanming Fang; Yi Zhao; Zibo Zhao
Abstract:	Categorical variables have no intrinsic ordering, and researchers often adopt a fixed-effect (FE) approach in empirical analysis. However, this approach has two significant limitations: it overlooks textual labels associated with the categorical variables; and it produces unstable results when there are only limited observations in a category. In this paper, we propose a novel method that utilizes recent advances in large language models (LLMs) to recover overlooked information in categorical variables. We apply this method to investigate labor market mismatch. Specifically, we task LLMs with simulating the role of a human resources specialist to assess the suitability of an applicant with specific characteristics for a given job. Our main findings can be summarized in three parts. First, using comprehensive administrative data from an online job posting platform, we show that our new match quality measure is positively correlated with several traditional measures in the literature, and at the same time, we highlight the LLM's capability to provide additional information conditional on the traditional measures. Second, we demonstrate the broad applicability of the new method with a survey data containing significantly less information than the administrative data, which makes it impossible to compute most of the traditional match quality measures. Our LLM measure successfully replicates most of the salient patterns observed in a hard-to-access administrative dataset using easily accessible survey data. Third, we investigate the gender gap in match quality and explore whether there exists gender stereotypes in the hiring process. We simulate an audit study, examining whether revealing gender information to LLMs influences their assessment. We show that when gender information is disclosed to the GPT, the model deems females better suited for traditionally female-dominated roles.
JEL:	C55 J16 J24 J31
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:32327&r=big

Automated Social Science: Language Models as Scientist and Subjects

By:	Benjamin S. Manning; Kehang Zhu; John J. Horton
Abstract:	We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM's predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.11794&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.