nep-big 2024-10-07 papers

on Big Data

Issue of 2024‒10‒07
thirteen papers chosen by
Tom Coupé, University of Canterbury

Patent Text and Long-Run Innovation Dynamics: The Critical Role of Model Selection By Ina Ganguli; Jeffrey Lin; Vitaly Meursault; Nicholas F. Reynolds
Pricing American Options using Machine Learning Algorithms By Prudence Djagba; Callixte Ndizihiwe
Performance analysis of optimization methods for machine learning By Abbaszadehpeivasti, Hadi
Automate Strategy Finding with LLM in Quant investment By Zhizhuo Kou; Holam Yu; Jingshu Peng; Lei Chen
Deep Learning for Multi-Country GDP Prediction: A Study of Model Performance and Data Impact By Huaqing Xie; Xingcheng Xu; Fangjia Yan; Xun Qian; Yanqing Yang
Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry By Meena Jagadeesan; Michael I. Jordan; Jacob Steinhardt
Double Machine Learning at Scale to Predict Causal Impact of Customer Actions By Sushant More; Priya Kotwal; Sujith Chappidi; Dinesh Mandalapu; Chris Khawand
MANA-Net: Mitigating Aggregated Sentiment Homogenization with News Weighting for Enhanced Market Prediction By Mengyu Wang; Tiejun Ma
Health Inequality and Health Types By Margherita Borella; Francisco Bullano; Mariacristina De Nardi; Benjamin Krueger; Elena Manresa
Nowcasting Distributional National Accounts for the United States: A Machine Learning Approach By Gary Cornwall; Marina Gindelsky
The most attractive municipalities in Bolivia: an analysis with electricity consumption data and satellite images By Guillermo Guzmán Prudencio; Lykke E. Andersen
Data mining and NLP for Processing Social Offers of a National Aid Organization By Senst, Benjamin
Doombot versus other machine-learning methods for evaluating recession risks in OECD countries By Thomas Chalaux; Dave Turner

Patent Text and Long-Run Innovation Dynamics: The Critical Role of Model Selection

By:	Ina Ganguli; Jeffrey Lin; Vitaly Meursault; Nicholas F. Reynolds
Abstract:	As distorted maps may mislead, Natural Language Processing (NLP) models may misrepresent. How do we know which NLP model to trust? We provide comprehensive guidance for selecting and applying NLP representations of patent text. We develop novel validation tasks to evaluate several leading NLP models. These tasks assess how well candidate models align with both expert and non-expert judgments of patent similarity. State-of-the-art language models significantly outperform traditional approaches such as TF-IDF. Using our validated representations, we measure a secular decline in contemporaneous patent similarity: inventors are “spreading out” over an expanding knowledge frontier. This finding is corroborated by declining rates of multiple invention from newly-digitized historical patent interference records. In contrast, selecting another single representation without validating alternatives yields an ambiguous or even opposing trend. Thus, our framework addresses a fundamental challenge of selecting among different black-box NLP models that produce varying economic measurements. To facilitate future research, we plan to provide our validation task data and embeddings for all US patents from 1836–2023.
JEL:	C81 L19 O31
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:32934

Pricing American Options using Machine Learning Algorithms

By:	Prudence Djagba; Callixte Ndizihiwe
Abstract:	This study investigates the application of machine learning algorithms, particularly in the context of pricing American options using Monte Carlo simulations. Traditional models, such as the Black-Scholes-Merton framework, often fail to adequately address the complexities of American options, which include the ability for early exercise and non-linear payoff structures. By leveraging Monte Carlo methods in conjunction Least Square Method machine learning was used. This research aims to improve the accuracy and efficiency of option pricing. The study evaluates several machine learning models, including neural networks and decision trees, highlighting their potential to outperform traditional approaches. The results from applying machine learning algorithm in LSM indicate that integrating machine learning with Monte Carlo simulations can enhance pricing accuracy and provide more robust predictions, offering significant insights into quantitative finance by merging classical financial theories with modern computational techniques. The dataset was split into features and the target variable representing bid prices, with an 80-20 train-validation split. LSTM and GRU models were constructed using TensorFlow's Keras API, each with four hidden layers of 200 neurons and an output layer for bid price prediction, optimized with the Adam optimizer and MSE loss function. The GRU model outperformed the LSTM model across all evaluated metrics, demonstrating lower mean absolute error, mean squared error, and root mean squared error, along with greater stability and efficiency in training.
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2409.03204

Performance analysis of optimization methods for machine learning

By: Abbaszadehpeivasti, Hadi (Tilburg University, School of Economics and Management)

Date: 2024

URL: https://d.repec.org/n?u=RePEc:tiu:tiutis:3050a62d-1a1f-494e-99ef-7d118aa767e4

Automate Strategy Finding with LLM in Quant investment

By:	Zhizhuo Kou; Holam Yu; Jingshu Peng; Lei Chen
Abstract:	Despite significant progress in deep learning for financial trading, existing models often face instability and high uncertainty, hindering their practical application. Leveraging advancements in Large Language Models (LLMs) and multi-agent architectures, we propose a novel framework for quantitative stock investment in portfolio management and alpha mining. Our framework addresses these issues by integrating LLMs to generate diversified alphas and employing a multi-agent approach to dynamically evaluate market conditions. This paper proposes a framework where large language models (LLMs) mine alpha factors from multimodal financial data, ensuring a comprehensive understanding of market dynamics. The first module extracts predictive signals by integrating numerical data, research papers, and visual charts. The second module uses ensemble learning to construct a diverse pool of trading agents with varying risk preferences, enhancing strategy performance through a broader market analysis. In the third module, a dynamic weight-gating mechanism selects and assigns weights to the most relevant agents based on real-time market conditions, enabling the creation of an adaptive and context-aware composite alpha formula. Extensive experiments on the Chinese stock markets demonstrate that this framework significantly outperforms state-of-the-art baselines across multiple financial metrics. The results underscore the efficacy of combining LLM-generated alphas with a multi-agent architecture to achieve superior trading performance and stability. This work highlights the potential of AI-driven approaches in enhancing quantitative investment strategies and sets a new benchmark for integrating advanced machine learning techniques in financial trading can also be applied on diverse markets.
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2409.06289

Deep Learning for Multi-Country GDP Prediction: A Study of Model Performance and Data Impact

By:	Huaqing Xie; Xingcheng Xu; Fangjia Yan; Xun Qian; Yanqing Yang
Abstract:	GDP is a vital measure of a country's economic health, reflecting the total value of goods and services produced. Forecasting GDP growth is essential for economic planning, as it helps governments, businesses, and investors anticipate trends, make informed decisions, and promote stability and growth. While most previous works focus on the prediction of the GDP growth rate for a single country or by machine learning methods, in this paper we give a comprehensive study on the GDP growth forecasting in the multi-country scenario by deep learning algorithms. For the prediction of the GDP growth where only GDP growth values are used, linear regression is generally better than deep learning algorithms. However, for the regression and the prediction of the GDP growth with selected economic indicators, deep learning algorithms could be superior to linear regression. We also investigate the influence of the novel data -- the light intensity data on the prediction of the GDP growth, and numerical experiments indicate that they do not necessarily improve the prediction performance. Code is provided at https://github.com/Sariel2018/Multi-Coun try-GDP-Prediction.git.
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2409.02551

Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry

By:	Meena Jagadeesan; Michael I. Jordan; Jacob Steinhardt
Abstract:	Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this work, we study this issue from both an economic and an algorithmic point of view, focusing on a phenomenon that reduces barriers to entry. Specifically, an incumbent company risks reputational damage unless its model is sufficiently aligned with safety objectives, whereas a new company can more easily avoid reputational damage. To study this issue formally, we define a multi-objective high-dimensional regression framework that captures reputational damage, and we characterize the number of data points that a new company needs to enter the market. Our results demonstrate how multi-objective considerations can fundamentally reduce barriers to entry -- the required number of data points can be significantly smaller than the incumbent company's dataset size. En route to proving these results, we develop scaling laws for high-dimensional linear regression in multi-objective environments, showing that the scaling rate becomes slower when the dataset size is large, which could be of independent interest.
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2409.03734

Double Machine Learning at Scale to Predict Causal Impact of Customer Actions

By:	Sushant More; Priya Kotwal; Sujith Chappidi; Dinesh Mandalapu; Chris Khawand
Abstract:	Causal Impact (CI) of customer actions are broadly used across the industry to inform both short- and long-term investment decisions of various types. In this paper, we apply the double machine learning (DML) methodology to estimate the CI values across 100s of customer actions of business interest and 100s of millions of customers. We operationalize DML through a causal ML library based on Spark with a flexible, JSON-driven model configuration approach to estimate CI at scale (i.e., across hundred of actions and millions of customers). We outline the DML methodology and implementation, and associated benefits over the traditional potential outcomes based CI model. We show population-level as well as customer-level CI values along with confidence intervals. The validation metrics show a 2.2% gain over the baseline methods and a 2.5X gain in the computational time. Our contribution is to advance the scalable application of CI, while also providing an interface that allows faster experimentation, cross-platform support, ability to onboard new use cases, and improves accessibility of underlying code for partner teams.
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2409.02332

MANA-Net: Mitigating Aggregated Sentiment Homogenization with News Weighting for Enhanced Market Prediction

By:	Mengyu Wang; Tiejun Ma
Abstract:	It is widely acknowledged that extracting market sentiments from news data benefits market predictions. However, existing methods of using financial sentiments remain simplistic, relying on equal-weight and static aggregation to manage sentiments from multiple news items. This leads to a critical issue termed ``Aggregated Sentiment Homogenization'', which has been explored through our analysis of a large financial news dataset from industry practice. This phenomenon occurs when aggregating numerous sentiments, causing representations to converge towards the mean values of sentiment distributions and thereby smoothing out unique and important information. Consequently, the aggregated sentiment representations lose much predictive value of news data. To address this problem, we introduce the Market Attention-weighted News Aggregation Network (MANA-Net), a novel method that leverages a dynamic market-news attention mechanism to aggregate news sentiments for market prediction. MANA-Net learns the relevance of news sentiments to price changes and assigns varying weights to individual news items. By integrating the news aggregation step into the networks for market prediction, MANA-Net allows for trainable sentiment representations that are optimized directly for prediction. We evaluate MANA-Net using the S&P 500 and NASDAQ 100 indices, along with financial news spanning from 2003 to 2018. Experimental results demonstrate that MANA-Net outperforms various recent market prediction methods, enhancing Profit & Loss by 1.1% and the daily Sharpe ratio by 0.252.
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2409.05698

Health Inequality and Health Types

By:	Margherita Borella; Francisco Bullano; Mariacristina De Nardi; Benjamin Krueger; Elena Manresa
Abstract:	While health affects many economic outcomes, its dynamics are still poorly understood. We use k-means clustering, a machine learning technique, and data from the Health and Retirement Study to identify health types during middle and old age. We identify five health types: the vigorous resilient, the fair-health resilient, the fair-health vulnerable, the frail resilient, and the frail vulnerable. They are characterized by different starting health and health and mortality trajectories. Our five health types account for 84% of the variation in health trajectories and are not explained by observable characteristics, such as age, marital status, education, gender, race, health-related behaviors, and health insurance status, but rather, by one’s past health dynamics. We also show that health types are important drivers of health and mortality heterogeneity and dynamics. Our results underscore the importance of better understanding health type formation and of modeling it appropriately to properly evaluate the effects of health on people’s decisions and the implications of policy reforms.
Keywords:	Mortality dynamics; Health inequality; Health dynamics; Inequality; Health types
JEL:	I10
Date:	2024–08–23
URL:	https://d.repec.org/n?u=RePEc:fip:fedmoi:98796

Nowcasting Distributional National Accounts for the United States: A Machine Learning Approach

By:	Gary Cornwall; Marina Gindelsky
Abstract:	Inequality statistics are usually calculated from high-quality, comprehensive survey or administrative microdata. Naturally, this data is typically available with a lag of at least 9 months from the reference period. In turbulent times, there is interest in knowing the distributional impacts of observable aggregate business cycle and policy changes sooner. In this paper, we use an elastic net, a generalized model that incorporates lasso and ridge regressions as special cases, to nowcast the overall Gini coefficient and quintile-level income shares. We use national accounts data starting in 2000, published by the Bureau of Economic Analysis, as features instead of the underlying microdata to produce a series of distributional nowcasts for 2020-2022. We find that we can create advance inequality estimates approximately one month after the end of the calendar year, reducing the present lag by almost a year.
JEL:	C52 C53 D31 E01
Date:	2024–09
URL:	https://d.repec.org/n?u=RePEc:bea:papers:0130

The most attractive municipalities in Bolivia: an analysis with electricity consumption data and satellite images

By:	Guillermo Guzmán Prudencio (SDSN Bolivia); Lykke E. Andersen (SDSN Bolivia)
Abstract:	This research analyzes the most attractive municipalities in Bolivia, that is, those capable of attracting (and retaining) more population, using new methodological approaches, specifically, using large databases of electricity consumption (Big Data) and powerful satellite images. The central analysis identifies the most attractive municipalities in the country and describes them using certain of their essential characteristics (their population, their geographical position, their political status, and their transport infrastructure). The main hypothesis of the research maintains that the greater concurrence of these variables (essential characteristics) is positively related to the degree of attractiveness of the municipalities.
Keywords:	Bolivia, Migration, Attractive Municipalities, Big Data
JEL:	O15 O18
Date:	2023–10
URL:	https://d.repec.org/n?u=RePEc:iad:sdsnwp:0623

Data mining and NLP for Processing Social Offers of a National Aid Organization

By:	Senst, Benjamin
Abstract:	For large organisations with numerous organisational units, it can be challenging to keep track of individual events. In a joint project by Data Science for Social Good Berlin e.V. and the Data Science Hub of the German Red Cross, social services were processed over several phases between summer 2022 and summer 2024 using new technologies such as web scraping, data engineering, and natural language processing, and their implementation in various user applications was tested. More than 600, 000 web documents were collected and more than 30, 000 offers were identified. The results of this automated method were compared with the existing data set. Web scraping and subsequent processing are suitable for at least supplementing the previous approach. Web scraping, NLP, and data engineering offer large organisations the opportunity to effectively gain an overview of local events.
Date:	2024–09–06
URL:	https://d.repec.org/n?u=RePEc:osf:socarx:3pd4s

Doombot versus other machine-learning methods for evaluating recession risks in OECD countries

By:	Thomas Chalaux; Dave Turner
Abstract:	An extensive literature explains recession risks using a variety of financial and business cycle variables. The problem of selecting a parsimonious set of explanatory variables, which can differ between countries and prediction horizons, is naturally suited to machine-learning methods. The current paper compares models selected by conventional machine-learning methods with a customised algorithm, ‘Doombot’, which uses ‘brute force’ to test combinations of variables and imposes restrictions so that predictions are consistent with a coherent economic narrative. The same algorithms are applied to 20 OECD countries with an emphasis on out-of-sample testing using a rolling origin, including a window for the Global Financial Crisis. Despite the imposition of additional restrictions, Doombot is found to the best performing algorithm. Further testing confirms the imposition of judgmental constraints tends to improve rather than hinder out-of-sample performance. Moreover, these constraints provide a more coherent economic narrative and so mitigate the common ‘black box’ criticism of machine-learning methods.
Keywords:	forecast, GDP growth, LASSO, machine-learning methods, OCMT, Recession, risk
JEL:	E01 E17 E65 E58
Date:	2024–09–20
URL:	https://d.repec.org/n?u=RePEc:oec:ecoaaa:1821-en

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.

By:	Abbaszadehpeivasti, Hadi (Tilburg University, School of Economics and Management)
Date:	2024
URL:	https://d.repec.org/n?u=RePEc:tiu:tiutis:3050a62d-1a1f-494e-99ef-7d118aa767e4