|
on Big Data |
By: | Mueller, H.; Rauh, C.; Seimon, B. |
Abstract: | This article provides a structured description of openly available news topics and forecasts for armed conflict at the national and grid cell level starting January 2010. The news topics as well as the forecasts are updated monthly at conflictforecast.org and provide coverage for more than 170 countries and about 65, 000 grid cells of size 55x55km worldwide. The forecasts rely on Natural Language Processing (NLP) and machine learning techniques to leverage a large corpus of newspaper text for predicting sudden onsets of violence in peaceful countries. Our goals are to: a) support conflict prevention efforts by making our risk forecasts available to practitioners and research teams worldwide, b) facilitate additional research that can utilise risk forecasts for causal identification, and to c) provide an overview of the news landscape. |
Keywords: | Civil War, Conflict, Forecasting, Machine Learning, News Topics, Random Forest, Topic Models |
Date: | 2024–02–02 |
URL: | http://d.repec.org/n?u=RePEc:cam:camjip:2402&r=big |
By: | Maheronnaghsh, Mohammad Javad; Gheidi, Mohammad Mahdi; Fazli, MohammadAmin |
Abstract: | In the dynamic world of financial markets, accurate price predictions are essential for informed decision-making. This research proposal outlines a comprehensive study aimed at forecasting stock and currency prices using state-of-the-art Machine Learning (ML) techniques. By delving into the intricacies of models such as Transformers, LSTM, Simple RNN, NHits, and NBeats, we seek to contribute to the realm of financial forecasting, offering valuable insights for investors, financial analysts, and researchers. This article provides an in-depth overview of our methodology, data collection process, model implementations, evaluation metrics, and potential applications of our research findings. The research indicates that NBeats and NHits models exhibit superior performance in financial forecasting tasks, especially with limited data, while Transformers require more data to reach full potential. Our findings offer insights into the strengths of different ML techniques for financial prediction, highlighting specialized models like NBeats and NHits as top performers - thus informing model selection for real-world applications. To enhance readability, all acronyms used in the paper are defined below: ML: Machine Learning LSTM: Long Short-Term Memory RNN: Recurrent Neural Network NHits: Neural Hierarchical Interpolation for Time Series Forecasting NBeats: Neural Basis Expansion Analysis for Time Series ARIMA: Autoregressive Integrated Moving Average GARCH: Generalized Autoregressive Conditional Heteroskedasticity SVMs: Support Vector Machines CNNs: Convolutional Neural Networks MSE: Mean Squared Error MAE: Mean Absolute Error RMSE: Recurrent Mean Squared Error API: Application Programming Interface F1-score: F1 Score GRU: Gated Recurrent Unit yfinance: Yahoo Finance (a Python library for fetching financial data) |
Date: | 2023–09–30 |
URL: | http://d.repec.org/n?u=RePEc:osf:osfxxx:dzp26&r=big |
By: | Moseli Mots'oehli; Anton Nikolaev; Wawan B. IGede; John Lynham; Peter J. Mous; Peter Sadowski |
Abstract: | Fish stock assessment often involves manual fish counting by taxonomy specialists, which is both time-consuming and costly. We propose an automated computer vision system that performs both taxonomic classification and fish size estimation from images taken with a low-cost digital camera. The system first performs object detection and segmentation using a Mask R-CNN to identify individual fish from images containing multiple fish, possibly consisting of different species. Then each fish species is classified and the predicted length using separate machine learning models. These models are trained on a dataset of 50, 000 hand-annotated images containing 163 different fish species, ranging in length from 10cm to 250cm. Evaluated on held-out test data, our system achieves a $92\%$ intersection over union on the fish segmentation task, a $89\%$ top-1 classification accuracy on single fish species classification, and a $2.3$~cm mean error on the fish length estimation task. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.10916&r=big |
By: | N. Meltem Daysal (University of Copenhagen, CEBI, CESIfo, IZA); Sendhil Mullainathan (University of Chicago Booth School of Business); Ziad Obermeyer (University of California, Berkeley); Suproteem K. Sarkar (Harvard University); Mircea Trandafir (The Rockwool Foundation Research Unit) |
Abstract: | We consider the health effects of “precision†screening policies for cancer guided by algorithms. We show that machine learning models that predict breast cancer from health claims data outperform models based on just age and established risk factors. We estimate that screening women with high predicted risk of invasive tumors would reduce the long-run incidence of later-stage tumors by 40%. Screening high-risk women would also lead to half the rate of cancer overdiagnosis that screening low-risk women would. We show that these results depend crucially on the machine learning model’s prediction target. A model trained to predict positive mammography results leads to policies with weaker health effects and higher rates of overdiagnosis than a model trained to predict invasive tumors. |
Keywords: | breast cancer, precision screening, predictive modeling, machine leaning, health policy |
JEL: | I12 I18 J16 C55 |
Date: | 2022–12–17 |
URL: | http://d.repec.org/n?u=RePEc:kud:kucebi:2224&r=big |
By: | Jonathan Fuhr (School of Business and Economics, University of T\"ubingen); Philipp Berens (Hertie Institute for AI in Brain Health, University of T\"ubingen); Dominik Papies (School of Business and Economics, University of T\"ubingen) |
Abstract: | The estimation of causal effects with observational data continues to be a very active research area. In recent years, researchers have developed new frameworks which use machine learning to relax classical assumptions necessary for the estimation of causal effects. In this paper, we review one of the most prominent methods - "double/debiased machine learning" (DML) - and empirically evaluate it by comparing its performance on simulated data relative to more traditional statistical methods, before applying it to real-world data. Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships. This advantage enables a departure from traditional functional form assumptions typically necessary in causal effect estimation. However, we demonstrate that the method continues to critically depend on standard assumptions about causal structure and identification. When estimating the effects of air pollution on housing prices in our application, we find that DML estimates are consistently larger than estimates of less flexible methods. From our overall results, we provide actionable recommendations for specific choices researchers must make when applying DML in practice. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.14385&r=big |
By: | Thanos Konstantinidis; Giorgos Iacovides; Mingxue Xu; Tony G. Constantinides; Danilo Mandic |
Abstract: | There are multiple sources of financial news online which influence market movements and trader's decisions. This highlights the need for accurate sentiment analysis, in addition to having appropriate algorithmic trading techniques, to arrive at better informed trading decisions. Standard lexicon based sentiment approaches have demonstrated their power in aiding financial decisions. However, they are known to suffer from issues related to context sensitivity and word ordering. Large Language Models (LLMs) can also be used in this context, but they are not finance-specific and tend to require significant computational resources. To facilitate a finance specific LLM framework, we introduce a novel approach based on the Llama 2 7B foundational model, in order to benefit from its generative nature and comprehensive language manipulation. This is achieved by fine-tuning the Llama2 7B model on a small portion of supervised financial sentiment analysis data, so as to jointly handle the complexities of financial lexicon and context, and further equipping it with a neural network based decision mechanism. Such a generator-classifier scheme, referred to as FinLlama, is trained not only to classify the sentiment valence but also quantify its strength, thus offering traders a nuanced insight into financial news articles. Complementing this, the implementation of parameter-efficient fine-tuning through LoRA optimises trainable parameters, thus minimising computational and memory requirements, without sacrificing accuracy. Simulation results demonstrate the ability of the proposed FinLlama to provide a framework for enhanced portfolio management decisions and increased market returns. These results underpin the ability of FinLlama to construct high-return portfolios which exhibit enhanced resilience, even during volatile periods and unpredictable market events. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.12285&r=big |
By: | Michalopoulos, S; Rauh, C. |
Abstract: | Why are certain movies more successful in some markets than others? Are the entertainment products we consume reflective of our core values and beliefs? These questions drive our investigation into the relationship between a society’s oral tradition and the financial success of films. We combine a unique catalog of local tales, myths, and legends around the world with data on international movie screenings and revenues. First, we quantify the similarity between movies’ plots and traditional motifs employing machine learning techniques. Comparing the same movie across different markets, we establish that films that resonate more with local folklore systematically accrue higher revenue and are more likely to be screened. Second, we document analogous patterns within the US. Google Trends data reveal a pronounced interest in markets where ancestral narratives align more closely with a movie’s theme. Third, we delve into the explicit values transmitted by films, concentrating on the depiction of risk and gender roles. Films that promote risk-taking sell more in entrepreneurial societies today, rooted in traditions where characters pursue dangerous tasks successfully. Films portraying women in stereotypical roles continue to find a robust audience in societies with similar gender stereotypes in their folklore and where women today continue being relegated to subordinate positions. These findings underscore the enduring influence of traditional storytelling on entertainment patterns in the 21st century, highlighting a profound connection between movie consumption and deeply ingrained cultural narratives and values. |
Keywords: | Movies, Folklore, Culture, Values, Entertainment, Text Analysis, Media |
JEL: | N00 O10 P00 Z10 Z11 |
Date: | 2024–03–11 |
URL: | http://d.repec.org/n?u=RePEc:cam:camjip:2406&r=big |
By: | Shaojie Li; Xinqi Dong; Danqing Ma; Bo Dang; Hengyi Zang; Yulu Gong |
Abstract: | Mobile Internet user credit assessment is an important way for communication operators to establish decisions and formulate measures, and it is also a guarantee for operators to obtain expected benefits. However, credit evaluation methods have long been monopolized by financial industries such as banks and credit. As supporters and providers of platform network technology and network resources, communication operators are also builders and maintainers of communication networks. Internet data improves the user's credit evaluation strategy. This paper uses the massive data provided by communication operators to carry out research on the operator's user credit evaluation model based on the fusion LightGBM algorithm. First, for the massive data related to user evaluation provided by operators, key features are extracted by data preprocessing and feature engineering methods, and a multi-dimensional feature set with statistical significance is constructed; then, linear regression, decision tree, LightGBM, and other machine learning algorithms build multiple basic models to find the best basic model; finally, integrates Averaging, Voting, Blending, Stacking and other integrated algorithms to refine multiple fusion models, and finally establish the most suitable fusion model for operator user evaluation. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.14483&r=big |
By: | Arijit Das; Tanmoy Nandi; Prasanta Saha; Suman Das; Saronyo Mukherjee; Sudip Kumar Naskar; Diganta Saha |
Abstract: | Financial market like the price of stock, share, gold, oil, mutual funds are affected by the news and posts on social media. In this work deep learning based models are proposed to predict the trend of financial market based on NLP analysis of the twitter handles of leaders of different fields. There are many models available to predict financial market based on only the historical data of the financial component but combining historical data with news and posts of the social media like Twitter is the main objective of the present work. Substantial improvement is shown in the result. The main features of the present work are: a) proposing completely generalized algorithm which is able to generate models for any twitter handle and any financial component, b) predicting the time window for a tweets effect on a stock price c) analyzing the effect of multiple twitter handles for predicting the trend. A detailed survey is done to find out the latest work in recent years in the similar field, find the research gap, and collect the required data for analysis and prediction. State-of-the-art algorithm is proposed and complete implementation with environment is given. An insightful trend of the result improvement considering the NLP analysis of twitter data on financial market components is shown. The Indian and USA financial markets are explored in the present work where as other markets can be taken in future. The socio-economic impact of the present work is discussed in conclusion. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.12161&r=big |
By: | Marco Letta; Pierluigi Montalbano; Adriana Paolantonio |
Abstract: | The complex relationship between climate shocks, migration, and adaptation hampers a rigorous understanding of the heterogeneous mobility outcomes of farm households exposed to climate risk. To unpack this heterogeneity, the analysis combines longitudinal multi-topic household survey data from Nigeria with a causal machine learning approach, tailored to a conceptual framework bridging economic migration theory and the poverty traps literature. The results show that pre-shock asset levels, in situ adaptive capacity, and cumulative shock exposure drive not just the magnitude but also the sign of the impact of agriculture-relevant weather anomalies on the mobility outcomes of farming households. While local adaptation acts as a substitute for migration, the roles played by wealth constraints and repeated shock exposure suggest the presence of climate-induced immobility traps. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.09470&r=big |
By: | Leogrande, Angelo |
Abstract: | In the following article I take into consideration the role of knowledge workers in the Italian regions. The analysed data refers to the ISTAT-BES database. The metric analysis consists of an in-depth analysis of the trends of the regions and macro-regions, followed by clustering with the k-Means algorithm, the application of machine learning algorithms for prediction, and the presentation of an econometric model with panel methods date. The results are also critically discussed in light of the North-South divide and the economic policy implications. |
Keywords: | Innovation, Innovation and Invention, Management of Technological Innovation and R&D, Technological Change, Intellectual Property and Intellectual Capital |
JEL: | O3 O30 O31 O32 O33 O34 |
Date: | 2024–03–26 |
URL: | http://d.repec.org/n?u=RePEc:pra:mprapa:120550&r=big |
By: | Oguzhan Cepni (Copenhagen Business School, Department of Economics, Porcelaenshaven 16A, Frederiksberg DK-2000, Denmark; Ostim Technical University, Ankara, Turkiye); Riza Demirer (Department of Economics and Finance, Southern Illinois University Edwardsville, Edwardsville, IL 62026-1102, USA); Rangan Gupta (Department of Economics, University of Pretoria, Private Bag X20, Hatfield 0028, South Africa); Christian Pierdzioch (Department of Economics, Helmut Schmidt University, Holstenhofweg 85, P.O.B. 700822, 22008 Hamburg, Germany) |
Abstract: | This paper extends the literature on the nexus between political geography and financial markets to the stock market volatility context by examining the interrelation between political geography and the predictive relation between the state- and aggregate-level stock market volatility via recently constructed measures of political alignment. Using monthly data for the period from February 1994 to March 2023 and a machine learning technique called random forests, we show that the importance of the state-level realized stock market volatilities as a driver of aggregate stock market volatility displays considerable cross- sectional dispersion as well as substantial variation over time, with the state of New York playing a prominent role. Further analysis shows that stronger political alignment of a state with the ruling party is associated with a lower contribution of the state's realized volatility to aggregate stock market volatility, highlighting the role of risk effects associated with the political geography of firms. Finally, we show that the negative link between the political alignment of a state and the importance of that state's realized volatility over aggregate stock market volatility is statistically significant during high-sentiment periods, but weak and statistically insignificant during low-sentiment periods, underscoring the role of investor sentiment for the nexus between political geography and financial markets. Our findings presents new insight to the risk-based arguments that associate political geography with stock market dynamics. |
Keywords: | Stock market volatility, Random forests, Political alignment, Investor sentiment |
JEL: | C22 C23 C51 C53 G10 D81 |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:pre:wpaper:202414&r=big |
By: | Gustavo Romero Cardoso; Marcio Issao Nakane |
Abstract: | The purpose of this paper is to explore the impact of textual data on the Brazilian economic cycle. Utilizing a dataset of articles from †Valor Econômico†newspaper spanning from July 2011 to December 2022, we employ the topic model Latent Dirichlet Allocation (LDA) to transform this textual data into a series of monthly topic proportions. From this output, we have developed two news indices, each with distinct methodologies but sharing the objective of assessing the influence of news topics on asset prices. We incorporate these indices into a structural VAR model to differentiate between news and noise shocks and to analyze their effects on macroeconomic variables. Our results reveal that news shocks, as captured by the news indices, significantly impact both asset prices and a range of macroeconomic indicators. Both news and noise shocks are found to be crucial in explaining a considerable proportion of the variance in asset prices over short and longterm periods, underscoring the pivotal role of news information in market dynamics. |
Keywords: | News; textual data; Latent Dirichlet Allocation; Brazilian business cycles |
JEL: | C8 C55 E32 |
Date: | 2024–04–02 |
URL: | http://d.repec.org/n?u=RePEc:spa:wpaper:2024wpecon12&r=big |
By: | Espíndola, Ernesto; Suárez, José Ignacio |
Abstract: | In recent decades, rapid technological progress has generated a growing interest in the transformation of the world of work. This concern is based on the potential of emerging technologies to replace tasks and roles traditionally performed by human beings, either partially or entirely. It is, therefore, essential to examine and understand the social, economic, and ethical implications of this process and seek solutions to harness the benefits associated with the automation of production processes and mitigate possible negative impacts. This paper seeks to estimate job automation's probabilities and risks and analyse its potential impacts on labour inclusion in Latin America. To this end, this document implemented a machine learning-based methodology adapted to the specific characteristics of the region using data from PIAAC surveys and household surveys. In this way, the aim is to build a probability vector of job automation adapted to the region. This vector can be reused in any source of information that contains internationally comparable occupational codes, such as household surveys or employment surveys. The study provides novel estimates of labour automation based on Latin American data and analyses the phenomenon in different aspects of labour inclusion and social stratification. The results show that the risks of automation vary among different social groups, which points to the need to build adapted and efficient policies that address the diverse needs that this process imposes. To this end, the document addresses different policy areas to promote effective labour inclusion in an era of rapid advances in intelligent technologies, ensuring that all individuals can access decent employment and so that these inequalities can be addressed effectively. |
Date: | 2024–03–25 |
URL: | http://d.repec.org/n?u=RePEc:ecr:col041:69088&r=big |
By: | Suss, Joel; Kemeny, Tom; Connor, Dylan S. |
Abstract: | Wealth inequality has been sharply rising in the United States and across many other high-income countries. Due to a lack of data, we know little about how this trend has unfolded across locations within countries. Examining the subnational geography of wealth is crucial because, from one generation to the next, it shapes the distribution of opportunity, disadvantage, and power across individuals and communities. By employing machine-learning-based imputation to link national historical surveys conducted by the U.S. Federal Reserve to population survey microdata, the data presented in this article addresses this gap. The Geographic Wealth Inequality Database (“GEOWEALTH-US”) provides the first estimates of the level and distribution of wealth at various geographical scales within the United States from 1960 to 2020. The GEOWEALTH-US database enables new lines of investigation into the contribution of spatial wealth disparities to major societal challenges including wealth concentration, income inequality, social mobility, housing unaffordability, and political polarization. |
JEL: | N0 |
Date: | 2024–02–28 |
URL: | http://d.repec.org/n?u=RePEc:ehl:lserod:122377&r=big |
By: | Jeff Dominitz; Charles F. Manski |
Abstract: | We argue that comprehensive out-of-sample (OOS) evaluation using statistical decision theory (SDT) should replace the current practice of K-fold and Common Task Framework validation in machine learning (ML) research. SDT provides a formal framework for performing comprehensive OOS evaluation across all possible (1) training samples, (2) populations that may generate training data, and (3) populations of prediction interest. Regarding feature (3), we emphasize that SDT requires the practitioner to directly confront the possibility that the future may not look like the past and to account for a possible need to extrapolate from one population to another when building a predictive algorithm. SDT is simple in abstraction, but it is often computationally demanding to implement. We discuss progress in tractable implementation of SDT when prediction accuracy is measured by mean square error or by misclassification rate. We summarize research studying settings in which the training data will be generated from a subpopulation of the population of prediction interest. We also consider conditional prediction with alternative restrictions on the state space of possible populations that may generate training data. We conclude by calling on ML researchers to join with econometricians and statisticians in expanding the domain within which implementation of SDT is tractable. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.11016&r=big |
By: | Christian Møller Dahl (University of Southern Denmark); Torben Johansen (University of Southern Denmark); Christian Vedel (University of Southern Denmark) |
Abstract: | This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall, and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history, and various related disciplines. |
Keywords: | Occupational Standardization, HISCO Classification System, Machine Learning in Economic History, Language Models |
JEL: | C55 C81 J1 N01 N3 N6 O33 |
Date: | 2024–04 |
URL: | http://d.repec.org/n?u=RePEc:hes:wpaper:0255&r=big |
By: | A. Fronzetti Colladon; R. Vestrelli; S. Bait; M. M. Schiraldi |
Abstract: | Various macroeconomic and institutional factors hinder FDI inflows, including corruption, trade openness, access to finance, and political instability. Existing research mostly focuses on country-level data, with limited exploration of firm-level data, especially in developing countries. Recognizing this gap, recent calls for research emphasize the need for qualitative data analysis to delve into FDI determinants, particularly at the regional level. This paper proposes a novel methodology, based on text mining and social network analysis, to get information from more than 167, 000 online news articles to quantify regional-level (sub-national) attributes affecting FDI ownership in African companies. Our analysis extends information on obstacles to industrial development as mapped by the World Bank Enterprise Surveys. Findings suggest that regional (sub-national) structural and institutional characteristics can play an important role in determining foreign ownership. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.10239&r=big |
By: | Bruno de Melo |
Abstract: | Performance attribution analysis, defined as the process of explaining the drivers of the excess performance of an investment portfolio against a benchmark, stands as a significant aspect of portfolio management and plays a crucial role in the investment decision-making process, particularly within the fund management industry. Rooted in a solid financial and mathematical framework, the importance and methodologies of this analytical technique are extensively documented across numerous academic research papers and books. The integration of large language models (LLMs) and AI agents marks a groundbreaking development in this field. These agents are designed to automate and enhance the performance attribution analysis by accurately calculating and analyzing portfolio performances against benchmarks. In this study, we introduce the application of an AI Agent for a variety of essential performance attribution tasks, including the analysis of performance drivers and utilizing LLMs as calculation engine for multi-level attribution analysis and question-answer (QA) exercises. Leveraging advanced prompt engineering techniques such as Chain-of-Thought (CoT) and Plan and Solve (PS), and employing a standard agent framework from LangChain, the research achieves promising results: it achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards. These findings affirm the impactful role of AI agents, prompt engineering and evaluation in advancing portfolio management processes, highlighting a significant advancement in the practical application and evaluation of AI technologies within the domain. |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2403.10482&r=big |
By: | Fabio Gatti |
Abstract: | Medieval and Early-Modern business correspondence between European companies constitutes a rich source of economic, business, and trade information in that the writing of letters was the very instrument through which merchants ordered and organized the shipments of goods, and performed financial operations. While a comprehensive analysis of such material enables scholars to re-construct the supply chains and sales of various goods, as well as identify the trading networks in the Europe, much of the archival sources have not undergone any systematic and quantitative analysis. In this paper we develop a new holistic and quantitative approach for analysing the entire outgoing, and so far unexploited, correspondence of a major Renaissance merchantbank - the Saminiati & Guasconi company of Florence - for the first years of its activity. After digitization of the letters, we employ an AI-based HTR model on the Transkribus platform and perform an automated-text analysis over the HTR-model’s output. For each letter (6, 376 epistles) this results in the identification of the addressee (446 merchants), their place of residence (65 towns), and the traded goods (27 main goods). The approach developed arguably provides a best-practice methodology for the quantitative treatment of medieval and early-modern merchant letters and the use of the derived historical text as data |
Date: | 2024–03 |
URL: | http://d.repec.org/n?u=RePEc:ube:dpvwib:dp2403&r=big |