nep-big New Economics Papers
on Big Data
Issue of 2024‒04‒08
thirteen papers chosen by
Tom Coupé, University of Canterbury


  1. An ocean of data: The potential of data on vessel traffic By Annabelle Mourougane; Emmanuelle Guidetti; Graham Pilgrim
  2. Sovereign Risk and Economic Complexity By Gomez-Gonzalez, Jose E.; Uribe, Jorge M.; Valencia, Oscar
  3. Movies By Michalopoulos, S; Rauh, C.
  4. Detecting Anomalous Events in Object-centric Business Processes via Graph Neural Networks By Alessandro Niro; Michael Werner
  5. A machine learning workflow to address credit default prediction By Rambod Rahmani; Marco Parola; Mario G. C. A. Cimino
  6. Financial Default Prediction via Motif-preserving Graph Neural Network with Curriculum Learning By Daixin Wang; Zhiqiang Zhang; Yeyu Zhao; Kai Huang; Yulin Kang; Jun Zhou
  7. Prediction Of Cryptocurrency Prices Using LSTM, SVM And Polynomial Regression By Novan Fauzi Al Giffary; Feri Sulianta
  8. Blockchain Metrics and Indicators in Cryptocurrency Trading By Juan C. King; Roberto Dale; Jos\'e M. Amig\'o
  9. Regularized DeepIV with Model Selection By Zihao Li; Hui Lan; Vasilis Syrgkanis; Mengdi Wang; Masatoshi Uehara
  10. Latent Dirichlet Allocation for structured insurance data By Jamotton, Charlotte; Hainaut, Donatien
  11. Applying News and Media Sentiment Analysis for Generating Forex Trading Signals By Oluwafemi F Olaiyapo
  12. Screen for collusive behavior: A machine learning approach By Bantle, Melissa
  13. Enhancing Price Prediction in Cryptocurrency Using Transformer Neural Network and Technical Indicators By Mohammad Ali Labbaf Khaniki; Mohammad Manthouri

  1. By: Annabelle Mourougane; Emmanuelle Guidetti; Graham Pilgrim
    Abstract: Rising uncertainties and geo-political tensions, together with increasingly complex trade relations have increased the demand for monitoring global trade in a timely manner. Although it was primarily designed to ensure vessel safety, information from the Automatic Information System, which allows for the tracking of vessels across the globe, is particularly well suited for providing insights on port activity and maritime trade developments, which accounts for a large share of global trade. Data are available in quasi real time but need to be pre-processed and validated. This paper contributes to existing research in this field in two major ways. First, it proposes a new methodology to identify ports, at a higher level of granularity than in past research. Second, it builds indicators to monitor port congestion and trends in maritime trade flows and provides more granular information to better understand those flows. Those indicators will still need to be refined, by complementing the AIS database with additional data sources, but already provide a useful source of information to monitor trade, at the country and global levels.
    Keywords: big data, maritime trade, port activity, port congestion
    JEL: C55 F17 C81
    Date: 2024–03–19
    URL: http://d.repec.org/n?u=RePEc:oec:stdaaa:2024/02-en&r=big
  2. By: Gomez-Gonzalez, Jose E.; Uribe, Jorge M.; Valencia, Oscar
    Abstract: This paper investigates how a country's economic complexity influences its sovereign yield spread with respect to the United States. Notably, a one-unit increase in the Economic Complexity Index is associated with a reduction of about 87 basis points in the 10-year yield spread. However, this effect is largely non-significant for maturities under three years. This suggests that economic complexity affects not only the level of the sovereign yield spreads but also the curve slope. The first set of models utilizes advanced causal machine learning tools, while the second focuses on economic complexity's predictive power. Economic complexity ranks among the top three predictors, alongside inflation and institutional factors like the rule of law. The paper also discusses the potential mechanisms through which economic complexity reduces sovereign risk and emphasizes its role as a long-run determinant of productivity, output, and income stability, and the likelihood of fiscal crises.
    Keywords: convenience yields;double-machine-learning;government debt;sovereign credit risk;XGBoost;yield curve
    JEL: F34 G12 G15 H63 O40
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:idb:brikps:13393&r=big
  3. By: Michalopoulos, S; Rauh, C.
    Abstract: Why are certain movies more successful in some markets than others? Are the entertainment products we consume reflective of our core values and beliefs? These questions drive our investigation into the relationship between a society’s oral tradition and the financial success of films. We combine a unique catalog of local tales, myths, and legends around the world with data on international movie screenings and revenues. First, we quantify the similarity between movies’ plots and traditional motifs employing machine learning techniques. Comparing the same movie across different markets, we establish that films that resonate more with local folklore systematically accrue higher revenue and are more likely to be screened. Second, we document analogous patterns within the US. Google Trends data reveal a pronounced interest in markets where ancestral narratives align more closely with a movie’s theme. Third, we delve into the explicit values transmitted by films, concentrating on the depiction of risk and gender roles. Films that promote risk-taking sell more in entrepreneurial societies today, rooted in traditions where characters pursue dangerous tasks successfully. Films portraying women in stereotypical roles continue to find a robust audience in societies with similar gender stereotypes in their folklore and where women today continue being relegated to subordinate positions. These findings underscore the enduring influence of traditional storytelling on entertainment patterns in the 21st century, highlighting a profound connection between movie consumption and deeply ingrained cultural narratives and values.
    Keywords: Movies, Folklore, Culture, Values, Entertainment, Text Analysis, Media
    JEL: N00 O10 P00 Z10 Z11
    Date: 2024–03–11
    URL: http://d.repec.org/n?u=RePEc:cam:camdae:2412&r=big
  4. By: Alessandro Niro; Michael Werner
    Abstract: Detecting anomalies is important for identifying inefficiencies, errors, or fraud in business processes. Traditional process mining approaches focus on analyzing 'flattened', sequential, event logs based on a single case notion. However, many real-world process executions exhibit a graph-like structure, where events can be associated with multiple cases. Flattening event logs requires selecting a single case identifier which creates a gap with the real event data and artificially introduces anomalies in the event logs. Object-centric process mining avoids these limitations by allowing events to be related to different cases. This study proposes a novel framework for anomaly detection in business processes that exploits graph neural networks and the enhanced information offered by object-centric process mining. We first reconstruct and represent the process dependencies of the object-centric event logs as attributed graphs and then employ a graph convolutional autoencoder architecture to detect anomalous events. Our results show that our approach provides promising performance in detecting anomalies at the activity type and attributes level, although it struggles to detect anomalies in the temporal order of events.
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.00775&r=big
  5. By: Rambod Rahmani; Marco Parola; Mario G. C. A. Cimino
    Abstract: Due to the recent increase in interest in Financial Technology (FinTech), applications like credit default prediction (CDP) are gaining significant industrial and academic attention. In this regard, CDP plays a crucial role in assessing the creditworthiness of individuals and businesses, enabling lenders to make informed decisions regarding loan approvals and risk management. In this paper, we propose a workflow-based approach to improve CDP, which refers to the task of assessing the probability that a borrower will default on his or her credit obligations. The workflow consists of multiple steps, each designed to leverage the strengths of different techniques featured in machine learning pipelines and, thus best solve the CDP task. We employ a comprehensive and systematic approach starting with data preprocessing using Weight of Evidence encoding, a technique that ensures in a single-shot data scaling by removing outliers, handling missing values, and making data uniform for models working with different data types. Next, we train several families of learning models, introducing ensemble techniques to build more robust models and hyperparameter optimization via multi-objective genetic algorithms to consider both predictive accuracy and financial aspects. Our research aims at contributing to the FinTech industry in providing a tool to move toward more accurate and reliable credit risk assessment, benefiting both lenders and borrowers.
    Date: 2024–03
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.03785&r=big
  6. By: Daixin Wang; Zhiqiang Zhang; Yeyu Zhao; Kai Huang; Yulin Kang; Jun Zhou
    Abstract: User financial default prediction plays a critical role in credit risk forecasting and management. It aims at predicting the probability that the user will fail to make the repayments in the future. Previous methods mainly extract a set of user individual features regarding his own profiles and behaviors and build a binary-classification model to make default predictions. However, these methods cannot get satisfied results, especially for users with limited information. Although recent efforts suggest that default prediction can be improved by social relations, they fail to capture the higher-order topology structure at the level of small subgraph patterns. In this paper, we fill in this gap by proposing a motif-preserving Graph Neural Network with curriculum learning (MotifGNN) to jointly learn the lower-order structures from the original graph and higherorder structures from multi-view motif-based graphs for financial default prediction. Specifically, to solve the problem of weak connectivity in motif-based graphs, we design the motif-based gating mechanism. It utilizes the information learned from the original graph with good connectivity to strengthen the learning of the higher-order structure. And considering that the motif patterns of different samples are highly unbalanced, we propose a curriculum learning mechanism on the whole learning process to more focus on the samples with uncommon motif distributions. Extensive experiments on one public dataset and two industrial datasets all demonstrate the effectiveness of our proposed method.
    Date: 2024–03
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.06482&r=big
  7. By: Novan Fauzi Al Giffary; Feri Sulianta
    Abstract: The rapid development of information technology, especially the Internet, has facilitated users with a quick and easy way to seek information. With these convenience offered by internet services, many individuals who initially invested in gold and precious metals are now shifting into digital investments in form of cryptocurrencies. However, investments in crypto coins are filled with uncertainties and fluctuation in daily basis. This risk posed as significant challenges for coin investors that could result in substantial investment losses. The uncertainty of the value of these crypto coins is a critical issue in the field of coin investment. Forecasting, is one of the methods used to predict the future value of these crypto coins. By utilizing the models of Long Short Term Memory, Support Vector Machine, and Polynomial Regression algorithm for forecasting, a performance comparison is conducted to determine which algorithm model is most suitable for predicting crypto currency prices. The mean square error is employed as a benchmark for the comparison. By applying those three constructed algorithm models, the Support Vector Machine uses a linear kernel to produce the smallest mean square error compared to the Long Short Term Memory and Polynomial Regression algorithm models, with a mean square error value of 0.02. Keywords: Cryptocurrency, Forecasting, Long Short Term Memory, Mean Square Error, Polynomial Regression, Support Vector Machine
    Date: 2024–03
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.03410&r=big
  8. By: Juan C. King; Roberto Dale; Jos\'e M. Amig\'o
    Abstract: The objective of this paper is the construction of new indicators that can be useful to operate in the cryptocurrency market. These indicators are based on public data obtained from the blockchain network, specifically from the nodes that make up Bitcoin mining. Therefore, our analysis is unique to that network. The results obtained with numerical simulations of algorithmic trading and prediction via statistical models and Machine Learning demonstrate the importance of variables such as the hash rate, the difficulty of mining or the cost per transaction when it comes to trade Bitcoin assets or predict the direction of price. Variables obtained from the blockchain network will be called here blockchain metrics. The corresponding indicators (inspired by the "Hash Ribbon") perform well in locating buy signals. From our results, we conclude that such blockchain indicators allow obtaining information with a statistical advantage in the highly volatile cryptocurrency market.
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.00770&r=big
  9. By: Zihao Li; Hui Lan; Vasilis Syrgkanis; Mengdi Wang; Masatoshi Uehara
    Abstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work.
    Date: 2024–03
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.04236&r=big
  10. By: Jamotton, Charlotte (Université catholique de Louvain, LIDAM/ISBA, Belgium); Hainaut, Donatien (Université catholique de Louvain, LIDAM/ISBA, Belgium)
    Abstract: This article explores the application of Latent Dirichlet Allocation (LDA) to structured tabular insurance data. LDA is a probabilistic topic modelling approach initially developed in Natural Language Processing (NLP) to uncover the underlying structure of (unstructured) textual data. It was designed to represent textual documents as mixture of latent (hidden) topics, and topics as mixtures of words. This study introduces the LDA’s document-topic distribution as a soft clustering tool for unsupervised learningtasks in the actuarial field. By defining each topic as a risk profile, and by treating insurance policies as documents and the modalities of categorical covariates as words, we show how LDA can be extended beyond textual data and can offer a framework to uncover underlying structures within insurance portfolios. Our experimental results and analysis highlight how the modelling of policies based on topic cluster membership, and the identification of dominant modalities within each risk profile, can give insights into the prominent risk factors contributing to higher or lower claim frequencies.
    Keywords: Latent dirichlet allocation ; topic modelling ; soft clustering ; insurance data ; risk profile ; natural language processing
    Date: 2024–03–08
    URL: http://d.repec.org/n?u=RePEc:aiz:louvad:2024008&r=big
  11. By: Oluwafemi F Olaiyapo
    Abstract: The objective of this research is to examine how sentiment analysis can be employed to generate trading signals for the Foreign Exchange (Forex) market. The author assessed sentiment in social media posts and news articles pertaining to the United States Dollar (USD) using a combination of methods: lexicon-based analysis and the Naive Bayes machine learning algorithm. The findings indicate that sentiment analysis proves valuable in forecasting market movements and devising trading signals. Notably, its effectiveness is consistent across different market conditions. The author concludes that by analyzing sentiment expressed in news and social media, traders can glean insights into prevailing market sentiments towards the USD and other pertinent countries, thereby aiding trading decision-making. This study underscores the importance of weaving sentiment analysis into trading strategies as a pivotal tool for predicting market dynamics.
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.00785&r=big
  12. By: Bantle, Melissa
    Abstract: The paper uses a machine learning technique to build up a screen for collusive behavior. Such tools can be applied by competition authorities but also by companies to screen the behavior of their suppliers. The method is applied to the German retail gasoline market to detect anomalous behavior in the price setting of the filling stations. Therefore, the algorithm identifies anomalies in the data-generating process. The results show that various anomalies can be detected with this method. These anomalies in the price setting behavior are then discussed with respect to their implications for the competitiveness of the market.
    Keywords: Machine Learning, Cartel Screens, Fuel Retail Market
    JEL: C53 K21 L44
    Date: 2024
    URL: http://d.repec.org/n?u=RePEc:zbw:hohdps:285380&r=big
  13. By: Mohammad Ali Labbaf Khaniki; Mohammad Manthouri
    Abstract: This study presents an innovative approach for predicting cryptocurrency time series, specifically focusing on Bitcoin, Ethereum, and Litecoin. The methodology integrates the use of technical indicators, a Performer neural network, and BiLSTM (Bidirectional Long Short-Term Memory) to capture temporal dynamics and extract significant features from raw cryptocurrency data. The application of technical indicators, such facilitates the extraction of intricate patterns, momentum, volatility, and trends. The Performer neural network, employing Fast Attention Via positive Orthogonal Random features (FAVOR+), has demonstrated superior computational efficiency and scalability compared to the traditional Multi-head attention mechanism in Transformer models. Additionally, the integration of BiLSTM in the feedforward network enhances the model's capacity to capture temporal dynamics in the data, processing it in both forward and backward directions. This is particularly advantageous for time series data where past and future data points can influence the current state. The proposed method has been applied to the hourly and daily timeframes of the major cryptocurrencies and its performance has been benchmarked against other methods documented in the literature. The results underscore the potential of the proposed method to outperform existing models, marking a significant progression in the field of cryptocurrency price prediction.
    Date: 2024–03
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2403.03606&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.