nep-big 2024-05-06 papers

on Big Data

Issue of 2024‒05‒06
twenty-six papers chosen by
Tom Coupé, University of Canterbury

Detection of Temporality at Discourse Level on Financial News by Combining Natural Language Processing and Machine Learning By Silvia Garc\'ia-M\'endez; Francisco de Arriba-P\'erez; Ana Barros-Vila; Francisco J. Gonz\'alez-Casta\~no
Bankruptcy prediction using machine learning and Shapley additive explanations By Hoang Hiep Nguyen; Jean-Laurent Viviani; Sami Ben Jabeur
BERTopic-Driven Stock Market Predictions: Unraveling Sentiment Insights By Enmin Zhu
Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation By Silvia Garc\'ia-M\'endez; Francisco de Arriba-P\'erez; Ana Barros-Vila; Francisco J. Gonz\'alez-Casta\~no; Enrique Costa-Montenegro
Long Short-Term Memory Pattern Recognition in Currency Trading By Jai Pal
Empowering Credit Scoring Systems with Quantum-Enhanced Machine Learning By Javier Mancilla; Andr\'e Sequeira; Iraitz Montalb\'an; Tomas Tagliani; Francisco Llaneza; Claudio Beiza
Do Socially Responsible Firms Disclosure to Signal? By Mari Sakudo
Improved model-free bounds for multi-asset options using option-implied information and deep learning By Evangelia Dragazi; Shuaiqiang Liu; Antonis Papapantoleon
Using Machine Learning to Forecast Market Direction with Efficient Frontier Coefficients By Nolan Alexander; William Scherer
Chain-structured neural architecture search for financial time series forecasting By Denis Levchenko; Efstratios Rappos; Shabnam Ataee; Biagio Nigro; Stephan Robert
Temporal Graph Networks for Graph Anomaly Detection in Financial Networks By Yejin Kim; Youngbin Lee; Minyoung Choe; Sungju Oh; Yongjae Lee
Predicting Full Retirement Attainment of NBA Players By Giorgos Foutzopoulos; Nikolaos Pandis; Michail Tsagris
What Hundreds of Economic News Events Say About Belief Overreaction in the Stock Market By Francesco Bianchi; Sydney C. Ludvigson; Sai Ma
Regional Economic Sentiment: Constructing Quantitative Estimates from the Beige Book and Testing Their Ability to Forecast Recessions By Ilias Filippou; Christian Garciga; James Mitchell; My T. Nguyen
Supervised Autoencoder MLP for Financial Time Series Forecasting By Bartosz Bieganowski; Robert Slepaczuk
Postprocessing of point predictions for probabilistic forecasting of electricity prices: Diversity matters By Arkadiusz Lipiecki; Bartosz Uniejewski; Rafa{\l} Weron
Social Media Emotions and Market Behavior By Domonkos F. Vamossy
Regulatory compliance with limited enforceability: Evidence from privacy policies By Ganglmair, Bernhard; Krämer, Julia; Gambato, Jacopo
Missing Data Imputation With Granular Semantics and AI-driven Pipeline for Bankruptcy Prediction By Debarati Chakraborty; Ravi Ranjan
Enhancing environmental management through big data: spatial analysis of urban ecological governance and big data development By Lei, Yunliang
Mortality Burden From Wildfire Smoke Under Climate Change By Minghao Qiu; Jessica Li; Carlos F. Gould; Renzhi Jing; Makoto Kelp; Marissa Childs; Mathew Kiang; Sam Heft-Neal; Noah Diffenbaugh; Marshall Burke
Adoption and diffusion of blockchain technology By Gschnaidtner, Christoph; Dehghan, Robert; Hottenrott, Hanna; Schwierzy, Julian
Enhancing Educational Outcome with Machine Learning: Modeling Friendship Formation, Measuring Peer Effect and Optimizing Class Assignment By Lei Bill Wang; Om Prakash Bedant; Haoran Wang; Zhenbang Jiao; Jia Yin
Construction of a Japanese Financial Benchmark for Large Language Models By Masanori Hirano
Forecasting with Neuro-Dynamic Programming By Pedro Afonso Fernandes
The anatomy of Chinese innovation: Insights on patent quality and ownership By Boeing, Philipp; Brandt, Loren; Dai, Ruochen; Lim, Kevin; Peters, Bettina

Detection of Temporality at Discourse Level on Financial News by Combining Natural Language Processing and Machine Learning

By:	Silvia Garc\'ia-M\'endez; Francisco de Arriba-P\'erez; Ana Barros-Vila; Francisco J. Gonz\'alez-Casta\~no
Abstract:	Finance-related news such as Bloomberg News, CNN Business and Forbes are valuable sources of real data for market screening systems. In news, an expert shares opinions beyond plain technical analyses that include context such as political, sociological and cultural factors. In the same text, the expert often discusses the performance of different assets. Some key statements are mere descriptions of past events while others are predictions. Therefore, understanding the temporality of the key statements in a text is essential to separate context information from valuable predictions. We propose a novel system to detect the temporality of finance-related news at discourse level that combines Natural Language Processing and Machine Learning techniques, and exploits sophisticated features such as syntactic and semantic dependencies. More specifically, we seek to extract the dominant tenses of the main statements, which may be either explicit or implicit. We have tested our system on a labelled dataset of finance-related news annotated by researchers with knowledge in the field. Experimental results reveal a high detection precision compared to an alternative rule-based baseline approach. Ultimately, this research contributes to the state-of-the-art of market screening by identifying predictive knowledge for financial decision making.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.01337&r=big

Bankruptcy prediction using machine learning and Shapley additive explanations

By:	Hoang Hiep Nguyen (CREM - Centre de recherche en économie et management - UNICAEN - Université de Caen Normandie - NU - Normandie Université - UR - Université de Rennes - CNRS - Centre National de la Recherche Scientifique); Jean-Laurent Viviani (CREM - Centre de recherche en économie et management - UNICAEN - Université de Caen Normandie - NU - Normandie Université - UR - Université de Rennes - CNRS - Centre National de la Recherche Scientifique); Sami Ben Jabeur (ESDES - ESDES, Lyon Business School - UCLy - UCLy - UCLy (Lyon Catholic University), UCLy - UCLy (Lyon Catholic University))
Abstract:	Recently, ensemble-based machine learning models have been widely used and have demonstrated their efficiency in bankruptcy prediction. However, these algorithms are black box models and people cannot understand why they make their forecasts. This explains why interpretability methods in machine learning attract attention from many artificial intelligence researchers. In this paper, we evaluate the prediction performance of Random Forest, LightGBM, XGBoost, and NGBoost (Natural Gradient Boosting for probabilistic prediction) for French firms from different industries with the horizon of 1-5 years. We then use Shapley Additive Explanations (SHAP), a model-agnostic method to explain XGBoost, one of the best models for our data. SHAP can show how each feature impacts the output from XGBoost. Furthermore, single prediction can also be explained, thus allowing black box models to be used in credit risk management.
Keywords:	Shapley additive explanations, Explainable machine learning, Bankruptcy prediction, Ensemble-based model, XGBoost
Date:	2023
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-04223161&r=big

BERTopic-Driven Stock Market Predictions: Unraveling Sentiment Insights

By:	Enmin Zhu
Abstract:	This paper explores the intersection of Natural Language Processing (NLP) and financial analysis, focusing on the impact of sentiment analysis in stock price prediction. We employ BERTopic, an advanced NLP technique, to analyze the sentiment of topics derived from stock market comments. Our methodology integrates this sentiment analysis with various deep learning models, renowned for their effectiveness in time series and stock prediction tasks. Through comprehensive experiments, we demonstrate that incorporating topic sentiment notably enhances the performance of these models. The results indicate that topics in stock market comments provide implicit, valuable insights into stock market volatility and price trends. This study contributes to the field by showcasing the potential of NLP in enriching financial analysis and opens up avenues for further research into real-time sentiment analysis and the exploration of emotional and contextual aspects of market sentiment. The integration of advanced NLP techniques like BERTopic with traditional financial analysis methods marks a step forward in developing more sophisticated tools for understanding and predicting market behaviors.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.02053&r=big

Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

By:	Silvia Garc\'ia-M\'endez; Francisco de Arriba-P\'erez; Ana Barros-Vila; Francisco J. Gonz\'alez-Casta\~no; Enrique Costa-Montenegro
Abstract:	Financial news items are unstructured sources of information that can be mined to extract knowledge for market screening applications. Manual extraction of relevant information from the continuous stream of finance-related news is cumbersome and beyond the skills of many investors, who, at most, can follow a few sources and authors. Accordingly, we focus on the analysis of financial news to identify relevant text and, within that text, forecasts and predictions. We propose a novel Natural Language Processing (NLP) system to assist investors in the detection of relevant financial events in unstructured textual sources by considering both relevance and temporality at the discursive level. Firstly, we segment the text to group together closely related text. Secondly, we apply co-reference resolution to discover internal dependencies within segments. Finally, we perform relevant topic modelling with Latent Dirichlet Allocation (LDA) to separate relevant from less relevant text and then analyse the relevant text using a Machine Learning-oriented temporal approach to identify predictions and speculative statements. We created an experimental data set composed of 2, 158 financial news items that were manually labelled by NLP researchers to evaluate our solution. The ROUGE-L values for the identification of relevant text and predictions/forecasts were 0.662 and 0.982, respectively. To our knowledge, this is the first work to jointly consider relevance and temporality at the discursive level. It contributes to the transfer of human associative discourse capabilities to expert systems through the combination of multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns, topic modelling with LDA to detect relevant text, and discursive temporality analysis to identify forecasts and predictions within this text.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.01338&r=big

Long Short-Term Memory Pattern Recognition in Currency Trading

By:	Jai Pal
Abstract:	This study delves into the analysis of financial markets through the lens of Wyckoff Phases, a framework devised by Richard D. Wyckoff in the early 20th century. Focusing on the accumulation pattern within the Wyckoff framework, the research explores the phases of trading range and secondary test, elucidating their significance in understanding market dynamics and identifying potential trading opportunities. By dissecting the intricacies of these phases, the study sheds light on the creation of liquidity through market structure, offering insights into how traders can leverage this knowledge to anticipate price movements and make informed decisions. The effective detection and analysis of Wyckoff patterns necessitate robust computational models capable of processing complex market data, with spatial data best analyzed using Convolutional Neural Networks (CNNs) and temporal data through Long Short-Term Memory (LSTM) models. The creation of training data involves the generation of swing points, representing significant market movements, and filler points, introducing noise and enhancing model generalization. Activation functions, such as the sigmoid function, play a crucial role in determining the output behavior of neural network models. The results of the study demonstrate the remarkable efficacy of deep learning models in detecting Wyckoff patterns within financial data, underscoring their potential for enhancing pattern recognition and analysis in financial markets. In conclusion, the study highlights the transformative potential of AI-driven approaches in financial analysis and trading strategies, with the integration of AI technologies shaping the future of trading and investment practices.
Date:	2024–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2403.18839&r=big

Empowering Credit Scoring Systems with Quantum-Enhanced Machine Learning

By:	Javier Mancilla; Andr\'e Sequeira; Iraitz Montalb\'an; Tomas Tagliani; Francisco Llaneza; Claudio Beiza
Abstract:	Quantum Kernels are projected to provide early-stage usefulness for quantum machine learning. However, highly sophisticated classical models are hard to surpass without losing interpretability, particularly when vast datasets can be exploited. Nonetheless, classical models struggle once data is scarce and skewed. Quantum feature spaces are projected to find better links between data features and the target class to be predicted even in such challenging scenarios and most importantly, enhanced generalization capabilities. In this work, we propose a novel approach called Systemic Quantum Score (SQS) and provide preliminary results indicating potential advantage over purely classical models in a production grade use case for the Finance sector. SQS shows in our specific study an increased capacity to extract patterns out of fewer data points as well as improved performance over data-hungry algorithms such as XGBoost, providing advantage in a competitive market as it is the FinTech and Neobank regime.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.00015&r=big

Do Socially Responsible Firms Disclosure to Signal?

By:	Mari Sakudo
Abstract:	An increasing number of investors incorporate companies' CSR information into their financial decisions. This study empirically examines the signaling theory in the context of CSR disclosures using rich information on firms' CSR activities and climate-related costs of large Japanese firms by a machine learning method. According to the results, Japanese firms disclose their sustainability information to signal their superior performance rather than greenwashing. While many investors and policy makers focus more on climate risks following the COVID-19 pandemic, this empirical evidence remains the same before and after the crisis.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:tcr:wpaper:e204&r=big

Improved model-free bounds for multi-asset options using option-implied information and deep learning

By:	Evangelia Dragazi; Shuaiqiang Liu; Antonis Papapantoleon
Abstract:	We consider the computation of model-free bounds for multi-asset options in a setting that combines dependence uncertainty with additional information on the dependence structure. More specifically, we consider the setting where the marginal distributions are known and partial information, in the form of known prices for multi-asset options, is also available in the market. We provide a fundamental theorem of asset pricing in this setting, as well as a superhedging duality that allows to transform the maximization problem over probability measures in a more tractable minimization problem over trading strategies. The latter is solved using a penalization approach combined with a deep learning approximation using artificial neural networks. The numerical method is fast and the computational time scales linearly with respect to the number of traded assets. We finally examine the significance of various pieces of additional information. Empirical evidence suggests that "relevant" information, i.e. prices of derivatives with the same payoff structure as the target payoff, are more useful that other information, and should be prioritized in view of the trade-off between accuracy and computational efficiency.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.02343&r=big

Using Machine Learning to Forecast Market Direction with Efficient Frontier Coefficients

By:	Nolan Alexander; William Scherer
Abstract:	We propose a novel method to improve estimation of asset returns for portfolio optimization. This approach first performs a monthly directional market forecast using an online decision tree. The decision tree is trained on a novel set of features engineered from portfolio theory: the efficient frontier functional coefficients. Efficient frontiers can be decomposed to their functional form, a square-root second-order polynomial, and the coefficients of this function captures the information of all the constituents that compose the market in the current time period. To make these forecasts actionable, these directional forecasts are integrated to a portfolio optimization framework using expected returns conditional on the market forecast as an estimate for the return vector. This conditional expectation is calculated using the inverse Mills ratio, and the Capital Asset Pricing Model is used to translate the market forecast to individual asset forecasts. This novel method outperforms baseline portfolios, as well as other feature sets including technical indicators and the Fama-French factors. To empirically validate the proposed model, we employ a set of market sector ETFs.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.00825&r=big

Chain-structured neural architecture search for financial time series forecasting

By:	Denis Levchenko; Efstratios Rappos; Shabnam Ataee; Biagio Nigro; Stephan Robert
Abstract:	We compare three popular neural architecture search strategies on chain-structured search spaces: Bayesian optimization, the hyperband method, and reinforcement learning in the context of financial time series forecasting.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2403.14695&r=big

Temporal Graph Networks for Graph Anomaly Detection in Financial Networks

By:	Yejin Kim; Youngbin Lee; Minyoung Choe; Sungju Oh; Yongjae Lee
Abstract:	This paper explores the utilization of Temporal Graph Networks (TGN) for financial anomaly detection, a pressing need in the era of fintech and digitized financial transactions. We present a comprehensive framework that leverages TGN, capable of capturing dynamic changes in edges within financial networks, for fraud detection. Our study compares TGN's performance against static Graph Neural Network (GNN) baselines, as well as cutting-edge hypergraph neural network baselines using DGraph dataset for a realistic financial context. Our results demonstrate that TGN significantly outperforms other models in terms of AUC metrics. This superior performance underlines TGN's potential as an effective tool for detecting financial fraud, showcasing its ability to adapt to the dynamic and complex nature of modern financial systems. We also experimented with various graph embedding modules within the TGN framework and compared the effectiveness of each module. In conclusion, we demonstrated that, even with variations within TGN, it is possible to achieve good performance in the anomaly detection task.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.00060&r=big

Predicting Full Retirement Attainment of NBA Players

By:	Giorgos Foutzopoulos; Nikolaos Pandis; Michail Tsagris
Abstract:	The aim of this analysis is to predict whether an NBA player will be active in the league for at least 10 years so as to be qualified for NBAâ€™s full retirement scheme which allows for the maximum benefit payable by law. We collected per game statistics for players during their second year, drafted during the years 1999 up to 2006, for which information on their career longetivity is known. By feeding these statistics of the sophomore players into statistical and machine learning algorithms we select the important statistics and manage to accomplish a satisfactory predictability performance. Further, we visualize the effect of each of the selected statistics on the estimated probability of staying in the league for more than 10 years
Keywords:	NBA, career duration, exit discrimination
JEL:	C41 C10 L83
Date:	2024–04–20
URL:	http://d.repec.org/n?u=RePEc:crt:wpaper:2403&r=big

What Hundreds of Economic News Events Say About Belief Overreaction in the Stock Market

By:	Francesco Bianchi; Sydney C. Ludvigson; Sai Ma
Abstract:	We measure the nature and severity of a variety of belief distortions in market reactions to hundreds of economic news events using a new methodology that synthesizes estimation of a structural asset pricing model with algorithmic machine learning to quantify bias. We estimate that investors systematically overreact to perceptions about multiple fundamental shocks in a macro-dynamic system, generating asymmetric compositional effects when several counteracting shocks occur simultaneously in real-world events. We show that belief overreaction to all shocks can lead the market to over- or underreact to events, amplifying or dampening volatility.
JEL:	G1 G12 G4 G41
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:32301&r=big

Regional Economic Sentiment: Constructing Quantitative Estimates from the Beige Book and Testing Their Ability to Forecast Recessions

By:	Ilias Filippou; Christian Garciga; James Mitchell; My T. Nguyen
Abstract:	We use natural language processing methods to quantify the sentiment expressed in the Federal Reserve's anecdotal summaries of current economic conditions in the national and 12 Federal Reserve District-level economies as published eight times per year in the Beige Book since 1970. We document that both national and District-level economic sentiment tend to rise and fall with the US business cycle. But economic sentiment is extremely heterogeneous across Districts, and we find that national economic sentiment is not always the simple aggregation of District-level sentiment. We show that the heterogeneity in District-level economic sentiment can be used, over and above the information contained in national economic sentiment, to better forecast US recessions.
Keywords:	recessions; natural language processing; sentiment; Beige book; regional economies
Date:	2024–04–16
URL:	http://d.repec.org/n?u=RePEc:fip:fedcwq:98080&r=big

Supervised Autoencoder MLP for Financial Time Series Forecasting

By:	Bartosz Bieganowski; Robert Slepaczuk
Abstract:	This paper investigates the enhancement of financial time series forecasting with the use of neural networks through supervised autoencoders, aiming to improve investment strategy performance. It specifically examines the impact of noise augmentation and triple barrier labeling on risk-adjusted returns, using the Sharpe and Information Ratios. The study focuses on the S&P 500 index, EUR/USD, and BTC/USD as the traded assets from January 1, 2010, to April 30, 2022. Findings indicate that supervised autoencoders, with balanced noise augmentation and bottleneck size, significantly boost strategy effectiveness. However, excessive noise and large bottleneck sizes can impair performance, highlighting the importance of precise parameter tuning. This paper also presents a derivation of a novel optimization metric that can be used with triple barrier labeling. The results of this study have substantial policy implications, suggesting that financial institutions and regulators could leverage techniques presented to enhance market stability and investor protection, while also encouraging more informed and strategic investment approaches in various financial sectors.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.01866&r=big

Postprocessing of point predictions for probabilistic forecasting of electricity prices: Diversity matters

By:	Arkadiusz Lipiecki; Bartosz Uniejewski; Rafa{\l} Weron
Abstract:	Operational decisions relying on predictive distributions of electricity prices can result in significantly higher profits compared to those based solely on point forecasts. However, the majority of models developed in both academic and industrial settings provide only point predictions. To address this, we examine three postprocessing methods for converting point forecasts into probabilistic ones: Quantile Regression Averaging, Conformal Prediction, and the recently introduced Isotonic Distributional Regression. We find that while IDR demonstrates the most varied performance, combining its predictive distributions with those of the other two methods results in an improvement of ca. 7.5% compared to a benchmark model with normally distributed errors, over a 4.5-year test period in the German power market spanning the COVID pandemic and the war in Ukraine. Remarkably, the performance of this combination is at par with state-of-the-art Distributional Deep Neural Networks.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.02270&r=big

Social Media Emotions and Market Behavior

By:	Domonkos F. Vamossy
Abstract:	I explore the relationship between investor emotions expressed on social media and asset prices. The field has seen a proliferation of models aimed at extracting firm-level sentiment from social media data, though the behavior of these models often remains uncertain. Against this backdrop, my study employs EmTract, an open-source emotion model, to test whether the emotional responses identified on social media platforms align with expectations derived from controlled laboratory settings. This step is crucial in validating the reliability of digital platforms in reflecting genuine investor sentiment. My findings reveal that firm-specific investor emotions behave similarly to lab experiments and can forecast daily asset price movements. These impacts are larger when liquidity is lower or short interest is higher. My findings on the persistent influence of sadness on subsequent returns, along with the insignificance of the one-dimensional valence metric, underscores the importance of dissecting emotional states. This approach allows for a deeper and more accurate understanding of the intricate ways in which investor sentiments drive market movements.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.03792&r=big

Regulatory compliance with limited enforceability: Evidence from privacy policies

By:	Ganglmair, Bernhard; Krämer, Julia; Gambato, Jacopo
Abstract:	The EU General Data Protection Regulation (GDPR) of 2018 introduced stringent transparency rules compelling firms to disclose, in accessible language, details of their data collection, processing, and use. The specifics of the disclosure requirement are objective, and its compliance is easily verifiable; readability, however, is subjective and difficult to enforce. We use a simple inspection model to show how this asymmetric enforceability of regulatory rules and the corresponding firm compliance are linked. We then examine this link empirically using a large sample of privacy policies from German firms. We use text-as-data techniques to construct measures of disclosure and readability and show that firms increased the disclosure volume, but the readability of their privacy policies did not improve. Larger firms in concentrated industries demonstrated a stronger response in readability compliance, potentially due to heightened regulatory scrutiny. Moreover, data protection authorities with larger budgets induce better readability compliance without effects on disclosure.
Keywords:	data protection, disclosure, GDPR, privacy policies, readability, regulation, text-as-data, topic models
JEL:	C81 D23 K12 K20 L51 M15
Date:	2024
URL:	http://d.repec.org/n?u=RePEc:zbw:zewdip:289447&r=big

Missing Data Imputation With Granular Semantics and AI-driven Pipeline for Bankruptcy Prediction

By:	Debarati Chakraborty; Ravi Ranjan
Abstract:	This work focuses on designing a pipeline for the prediction of bankruptcy. The presence of missing values, high dimensional data, and highly class-imbalance databases are the major challenges in the said task. A new method for missing data imputation with granular semantics has been introduced here. The merits of granular computing have been explored here to define this method. The missing values have been predicted using the feature semantics and reliable observations in a low-dimensional space, in the granular space. The granules are formed around every missing entry, considering a few of the highly correlated features and most reliable closest observations to preserve the relevance and reliability, the context, of the database against the missing entries. An intergranular prediction is then carried out for the imputation within those contextual granules. That is, the contextual granules enable a small relevant fraction of the huge database to be used for imputation and overcome the need to access the entire database repetitively for each missing value. This method is then implemented and tested for the prediction of bankruptcy with the Polish Bankruptcy dataset. It provides an efficient solution for big and high-dimensional datasets even with large imputation rates. Then an AI-driven pipeline for bankruptcy prediction has been designed using the proposed granular semantic-based data filling method followed by the solutions to the issues like high dimensional dataset and high class-imbalance in the dataset. The rest of the pipeline consists of feature selection with the random forest for reducing dimensionality, data balancing with SMOTE, and prediction with six different popular classifiers including deep NN. All methods defined here have been experimentally verified with suitable comparative studies and proven to be effective on all the data sets captured over the five years.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.00013&r=big

Enhancing environmental management through big data: spatial analysis of urban ecological governance and big data development

By:	Lei, Yunliang
Abstract:	Introduction: This research focuses on exploring the impact of Big Data Development (BDD) on Urban Ecological Governance Performance (EGP), with a particular emphasis on environmental dimensions within and among various regions. It aims to understand the complex interplay between technological advancements, urbanization, and environmental management in the context of urban ecological governance. Methods: Employing the Spatial Durbin Model (SDM), the study rigorously investigates the effects of BDD on EGP. It also examines the mediating role of Industrial Structure Level (ISL) and the moderating effects of both Level of Technological Investment (LTI) and Urbanization Level (URB), to provide a comprehensive analysis of the factors influencing urban ecological governance. Results: The findings reveal that big data significantly strengthens urban ecological governance, characterized by pronounced spatial spillover effects, indicating interregional interdependence in environmental management. Urbanization level notably amplifies the influence of BDD on EGP, whereas the magnitude of technological investments does not show a similar effect. Moreover, the industrial structure acts as a partial mediator in the relationship between BDD and EGP, with this mediating role demonstrating variability across different regions. Discussion: The research highlights the critical role of big data in enhancing urban ecological governance, particularly in terms of environmental aspects. It underscores the importance of technological advancements and urbanization in augmenting the effectiveness of ecological governance. The variability of the mediating role of industrial structure across regions suggests the need for tailored strategies in implementing big data initiatives for environmental management.
Keywords:	big data; ecological governance performance; environmental management; spatial analysis; spatial durbin model
JEL:	R14 J01
Date:	2024–03–12
URL:	http://d.repec.org/n?u=RePEc:ehl:lserod:122571&r=big

Mortality Burden From Wildfire Smoke Under Climate Change

By:	Minghao Qiu; Jessica Li; Carlos F. Gould; Renzhi Jing; Makoto Kelp; Marissa Childs; Mathew Kiang; Sam Heft-Neal; Noah Diffenbaugh; Marshall Burke
Abstract:	Wildfire activity has increased in the US and is projected to accelerate under future climate change. However, our understanding of the impacts of climate change on wildfire smoke and health remains highly uncertain. We quantify the past and future mortality burden in the US due to wildfire smoke fine particulate matter (PM2.5). We construct an ensemble of statistical and machine learning models that link variation in climate to wildfire smoke PM2.5, and empirically estimate smoke PM2.5-mortality relationships using georeferenced data on all recorded deaths in the US from 2006 to 2019. We project that climate-driven increases in future smoke PM2.5 could result in 27, 800 excess deaths per year by 2050 under a high warming scenario, a 76% increase relative to estimated 2011-2020 averages. Cumulative excess deaths from wildfire smoke PM2.5 could exceed 700, 000 between 2025-2055. When monetized, climate-induced smoke deaths result in annual damages of $244 billion by mid-century, comparable to the estimated sum of all other damages in the US in prior analyses. Our research suggests that the health cost of climate-driven wildfire smoke could be among the most important and costly consequences of a warming climate in the US.
JEL:	Q51 Q53 Q54
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:32307&r=big

Adoption and diffusion of blockchain technology

By:	Gschnaidtner, Christoph; Dehghan, Robert; Hottenrott, Hanna; Schwierzy, Julian
Abstract:	A widespread approach to measuring the innovative capacity of companies, sectors, and regions is the analysis of patents and trademarks or the use of surveys. In emerging digital technologies this approach may, however, not be sufficient for mapping technology diffusion. This applies to blockchain technology which is in essence, a decentralized and distributed database (management system) that is increasingly used well beyond its originally intended purpose as the underlying infrastructure for a peer-to-peer payment system. In this article, we use an alternative method based on web-analysis and deep learning techniques that allow us to identify companies that use blockchain technology to determine its diffusion. Our analysis shows that blockchain is still a niche technology with only 0.88% of the analyzed firms using it. At the same time, certain sectors, namely ICT, banking & finance, and (management) consulting, show higher adoption rates ranging from 3.50% to 4.50%. Most blockchain companies are located at or close to one of the financial centers. Young firms whose business model is (partly) based on blockchain technology also locate themselves close to these centers. Thus, despite blockchain technology often being explicitly characterized as decentralized and distributed in nature, these adoption and strategic location decisions lead to "blockchain clusters".
Keywords:	technology adoption, blockchain technology, geographical distribution of firms, natural language programming
JEL:	C45 O33 R30
Date:	2024
URL:	http://d.repec.org/n?u=RePEc:zbw:zewdip:289452&r=big

Enhancing Educational Outcome with Machine Learning: Modeling Friendship Formation, Measuring Peer Effect and Optimizing Class Assignment

By:	Lei Bill Wang; Om Prakash Bedant; Haoran Wang; Zhenbang Jiao; Jia Yin
Abstract:	In this paper, we look at a school principal's class assignment problem. We break the problem into three stages (1) friendship prediction (2) peer effect estimation (3) class assignment optimization. We build a micro-founded model for friendship formation and approximate the model as a neural network. Leveraging on the predicted friendship probability adjacent matrix, we improve the traditional linear-in-means model and estimate peer effect. We propose a new instrument to address the friendship selection endogeneity. The estimated peer effect is slightly larger than the linear-in-means model estimate. Using the friendship prediction and peer effect estimation results, we simulate counterfactual peer effects for all students. We find that dividing students into gendered classrooms increases average peer effect by 0.02 point on a scale of 5. We also find that extreme mixing class assignment method improves bottom quartile students' peer effect by 0.08 point.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.02497&r=big

Construction of a Japanese Financial Benchmark for Large Language Models

By:	Masanori Hirano
Abstract:	With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs in each domain. Therefore, in this study, we constructed a benchmark comprising multiple tasks specific to the Japanese and financial domains and performed benchmark measurements on some models. Consequently, we confirmed that GPT-4 is currently outstanding, and that the constructed benchmarks function effectively. According to our analysis, our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.
Date:	2024–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2403.15062&r=big

Forecasting with Neuro-Dynamic Programming

By:	Pedro Afonso Fernandes
Abstract:	Economic forecasting is concerned with the estimation of some variable like gross domestic product (GDP) in the next period given a set of variables that describes the current situation or state of the economy, including industrial production, retail trade turnover or economic confidence. Neuro-dynamic programming (NDP) provides tools to deal with forecasting and other sequential problems with such high-dimensional states spaces. Whereas conventional forecasting methods penalises the difference (or loss) between predicted and actual outcomes, NDP favours the difference between temporally successive predictions, following an interactive and trial-and-error approach. Past data provides a guidance to train the models, but in a different way from ordinary least squares (OLS) and other supervised learning methods, signalling the adjustment costs between sequential states. We found that it is possible to train a GDP forecasting model with data concerned with other countries that performs better than models trained with past data from the tested country (Portugal). In addition, we found that non-linear architectures to approximate the value function of a sequential problem, namely, neural networks can perform better than a simple linear architecture, lowering the out-of-sample mean absolute forecast error (MAE) by 32% from an OLS model.
Date:	2024–04
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2404.03737&r=big

The anatomy of Chinese innovation: Insights on patent quality and ownership

By:	Boeing, Philipp; Brandt, Loren; Dai, Ruochen; Lim, Kevin; Peters, Bettina
Abstract:	We study the evolution of patenting in China from 1985-2019. We use a Large Language Model to measure patent importance based on patent abstracts and classify patent ownership using a comprehensive business registry. We highlight four insights. First, average patent importance declined from 2000-2010 but has increased more recently. Second, private Chinese firms account for most of patenting growth whereas overseas patentees have played a diminishing role. Third, patentees have greatly reduced their dependence on foreign knowledge. Finally, Chinese and foreign patenting have become more similar in technological composition, but differences persist within technology classes as revealed by abstract similarities.
Keywords:	China, innovation, patents, large language model
JEL:	O30
Date:	2024
URL:	http://d.repec.org/n?u=RePEc:zbw:zewdip:289451&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.