|
on Big Data |
By: | Ziyao Zhou; Ronitt Mehra |
Abstract: | This project introduces an end-to-end trading system that leverages Large Language Models (LLMs) for real-time market sentiment analysis. By synthesizing data from financial news and social media, the system integrates sentiment-driven insights with technical indicators to generate actionable trading signals. FinGPT serves as the primary model for sentiment analysis, ensuring domain-specific accuracy, while Kubernetes is used for scalable and efficient deployment. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.01574 |
By: | Muhammad Sukri Bin Ramli |
Abstract: | This study examines the relationship between income inequality, gender, and school completion rates in Malaysia using machine learning techniques. The dataset utilized is from the Malaysia's Public Sector Open Data Portal, covering the period 2016-2022. The analysis employs various machine learning techniques, including K-means clustering, ARIMA modeling, Random Forest regression, and Prophet for time series forecasting. These models are used to identify patterns, trends, and anomalies in the data, and to predict future school completion rates. Key findings reveal significant disparities in school completion rates across states, genders, and income levels. The analysis also identifies clusters of states with similar completion rates, suggesting potential regional factors influencing educational outcomes. Furthermore, time series forecasting models accurately predict future completion rates, highlighting the importance of ongoing monitoring and intervention strategies. The study concludes with recommendations for policymakers and educators to address the observed disparities and improve school completion rates in Malaysia. These recommendations include targeted interventions for specific states and demographic groups, investment in early childhood education, and addressing the impact of income inequality on educational opportunities. The findings of this study contribute to the understanding of the factors influencing school completion in Malaysia and provide valuable insights for policymakers and educators to develop effective strategies to improve educational outcomes. |
Date: | 2025–01 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2501.18868 |
By: | Kamlakshya, Tikhnadhi (Citizens Bank); Hota, Ashish |
Abstract: | This paper introduces a novel framework for multimodal document intelligence, designed to enhance fraud prevention across various sectors. The core innovation lies in the integration of advanced AI and ML techniques, including OCR, deep learning, and NLP, within a purpose-built computer device for multimodal data fusion, as detailed in the author's recently granted patent by www.gov.uk/ [Intellectual Property# 6419907]. This device facilitates the seamless integration of textual, visual, and metadata elements extracted from documents, enabling a holistic understanding of the document's veracity and intent. The escalating sophistication of fraudulent activities across industries necessitates advanced, adaptive security measures. This paper presents a novel framework for multimodal document intelligence, designed to enhance fraud prevention in sectors such as banking and finance, life science and healthcare, government, and the public sector. Grounded in a recently patented AI and ML-enabled computer device for multimodal data fusion, the framework leverages Optical Character Recognition (OCR), deep learning-based image analysis, and natural language processing (NLP). Furthermore, it integrates the capabilities of DeepSeek-R1, a high-performance Mixture-of-Experts (MoE) large language model (LLM), and autonomous AI Agents for advanced reasoning, contextual understanding, and decision-making. This integrated approach facilitates proactive fraud detection, improved risk assessment, and strengthened compliance adherence, while also achieving unprecedented cost-effectiveness in deployment and operation. The efficacy of the framework is demonstrated through illustrative use cases, highlighting its potential to mitigate financial losses and uphold data integrity. Keywords: Salesforce, Salesforce Financial Cloud, RAG, Data Completeness, Finance, Sales, Campaign, Digital Engagement, Customer Data Platform (CDP), Data Cloud, DeepSeek-R1, Optical Character Recognition (OCR), deep learning-based image analysis, and natural language processing (NLP) |
Date: | 2025–02–11 |
URL: | https://d.repec.org/n?u=RePEc:osf:osfxxx:g5hw7_v1 |
By: | Mulder, Joris; Hocuk, Seyit; Kilic, Talip; Zezza, Alberto; Kumar, Pradeep |
Abstract: | Understanding men’s and women’s time use is a key factor in addressing issues and formulating policies related to division of labor, domestic work, and related gender disparities. However, obtaining data on individuals’ time use can be difficult and costly in the context of household surveys. Leveraging unique survey data collected in rural Malawi, this study investigates the possibility of predicting men’s and women’s time allocation to an extensive set of activities, using sensor signal data captured by accelerometers. Using machine learning techniques, the study builds a supervised classification model that is trained on the accelerometer data and a random subset of the time use survey data to predict individuals’ time allocation to 12 broad activity groups. The model can correctly classify each performed activity in 76 percent of the cases. The analysis shows that with 40 percent of the training data, this method can achieve 90 percent of the maximum level of predictive accuracy reached in the analysis. The findings prove the feasibility of this methodology and offer insights for enhancing both survey and accelerometer data collection processes to build better models. Using the method can improve the quality of costly and difficult to obtain time use surveys with cheaper, yet accurate, modeled estimates, obtained by combining objective data from wearable devices with time use data collected on smaller samples. |
Date: | 2024–06–28 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:19735 |
By: | Angelo Mele |
Abstract: | Exponential random graph models (ERGMs) are very flexible for modeling network formation but pose difficult estimation challenges due to their intractable normalizing constant. Existing methods, such as MCMC-MLE, rely on sequential simulation at every optimization step. We propose a neural network approach that trains on a single, large set of parameter-simulation pairs to learn the mapping from parameters to average network statistics. Once trained, this map can be inverted, yielding a fast and parallelizable estimation method. The procedure also accommodates extra network statistics to mitigate model misspecification. Some simple illustrative examples show that the method performs well in practice. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.01810 |
By: | Walker, Viviane; Angst, Mario |
Abstract: | Empirical research in the social sciences is often interested in understanding actor stances; the positions that social actors take regarding normative statements in societal discourse. In automated text analysis applications, the classification task of stance detection remains challenging. Stance detection is especially difficult due to semantic challenges such as implicitness or missing context but also due to the general nature of the task. In this paper, we explore the potential of Large Language Models (LLMs) to enable stance detection in a generalized (non-domain, non-statement specific) form. Specifically, we test a variety of different general prompt chains for zero-shot stance classifications. Our evaluation data consists of textual data from a real-world empirical research project in the domain of sustainable urban transport. For 1710 German newspaper paragraphs, each containing an organizational entity, we annotated the stance of the entity toward one of five normative statements. A comparison of four publicly available LLMs show that they can improve upon existing approaches and achieve adequate performance. However, results heavily depend on the prompt chain method, LLM, and vary by statement. Our findings have implications for computational linguistics methodology and political discourse analysis, as they offer a deeper understanding of the strengths and weaknesses of LLMs in performing the complex semantic task of stance detection. We strongly emphasise the necessity of domain-specific evaluation data for evaluating LLMs and considering trade-offs between model complexity and performance. |
Date: | 2025–02–03 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:5a3k8_v1 |
By: | RezaeeDaryakenari, Babak (Leiden University) |
Abstract: | While politicians often argue that economic sanctions can induce policy changes in targeted states by undermining elite and public support for the reigning government, the efficacy of these measures, particularly against non-democratic regimes, is debatable. We propose that, counterintuitively, economic sanctions can bolster rather than diminish support for the sanctioned government, even in non-democratic contexts. However, this support shift and its magnitude can differ across various political factions and depend on the nature of the sanctions. To empirically evaluate our theoretical expectations, we use supervised machine learning to scrutinize nearly 2 million tweets from over 1, 000 Iranian influencers, assessing their responses to both comprehensive and targeted sanctions during Donald Trump’s presidency. Our analysis shows that comprehensive sanctions generally improved sentiments toward the Iranian government, even among its moderate oppositions, rendering them more aligned with the state's stance. Conversely, while targeted sanctions elicited a milder rally-around-the-flag response, the identity of the targeted entity plays a crucial role in determining the scale of this reaction. |
Date: | 2024–08–29 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:r7ae4_v1 |
By: | Christian Fieberg; Lars Hornuf; Maximilian Meiler; David J. Streich |
Abstract: | We study whether large language models (LLMs) can generate suitable financial advice and which LLM features are associated with higher-quality advice. To this end, we elicit portfolio recommendations from 32 LLMs for 64 investor profiles, which differ in their risk preferences, home country, sustainability preferences, gender, and investment experience. Our results suggest that LLMs are generally capable of generating suitable financial advice that takes into account important investor characteristics when determining market and risk exposures. The historical performance of the recommended portfolios is on par with that of professionally managed benchmark portfolios. We also find that foundation models and larger models generate portfolios that are easier to implement and more sensitive to investor characteristics than fine-tuned models and smaller models. Some of our results are consistent with LLMs inheriting human biases such as home bias. We find no evidence of gender-based discrimination, which can be found in human financial advice. |
Keywords: | generative AI, artificial intelligence, large language models, financial advice portfolio management |
JEL: | G00 G11 G40 |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:ces:ceswps:_11666 |
By: | Pat Pataranutaporn; Nattavudh Powdthavee; Pattie Maes |
Abstract: | We investigate whether artificial intelligence can address the peer review crisis in economics by analyzing 27, 090 evaluations of 9, 030 unique submissions using a large language model (LLM). The experiment systematically varies author characteristics (e.g., affiliation, reputation, gender) and publication quality (e.g., top-tier, mid-tier, low-tier, AI generated papers). The results indicate that LLMs effectively distinguish paper quality but exhibit biases favoring prominent institutions, male authors, and renowned economists. Additionally, LLMs struggle to differentiate high-quality AI-generated papers from genuine top-tier submissions. While LLMs offer efficiency gains, their susceptibility to bias necessitates cautious integration and hybrid peer review models to balance equity and accuracy. |
Date: | 2025–01 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.00070 |
By: | Palinski, Michal; Asik, Günes; Gajderowicz, Tomasz; Jakubowski, Maciej; Efsan Nas Ozen; Dhushyanth Raju |
Abstract: | This study expands the inventory of green job titles by incorporating a global perspective and using contemporary sources. It leverages natural language processing, specifically a retrieval-augmented generation model, to identify green job titles. The process began with a search of academic literature published after 2008 using the official APIs of Scopus and Web of Science. The search yielded 1, 067 articles, from which 695 unique potential green job titles were identified. The retrieval-augmented generation model used the advanced text analysis capabilities of Generative Pre-trained Transformer 4, providing a reproducible method to categorize jobs within various green economy sectors. The research clustered these job titles into 25 distinct sectors. This categorization aligns closely with established frameworks, such as the U.S. Department of Labor’s Occupational Information Network, and suggests potential new categories like green human resources. The findings demonstrate the efficacy of advanced natural language processing models in identifying emerging green job roles, contributing significantly to the ongoing discourse on the green economy transition. |
Date: | 2024–09–16 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10908 |
By: | Ismael Yacoubou Djima; Marco Tiberti; Talip Kilic |
Abstract: | This paper addresses the challenge of missing crop yield data in large-scale agricultural surveys, where crop-cutting, the most accurate method for yield measurement, is often limited due to cost constraints. Multiple imputation techniques, supported by machine learning models are used to predict missing yield data. This method is validated using survey data from Mali, which includes both crop-cut and self-reported yield information. The analysis covers several crops, providing insights into the importance of different predictors, including farmer-reported yields and geo-spatial variables, and the conditions under which the approach is valid. The findings show that machine learning-based imputations can provide accurate yield estimates, especially for crops with low intercropping rates and higher commercialization. However, survey-to-survey imputations are less accurate than within-survey imputations, suggesting limitations in extrapolating data across different survey rounds. The study contributes valuable insights into improving cost-efficiency in agricultural surveys and the potential of imputation methods. |
Date: | 2024–11–04 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10964 |
By: | Senst, Benjamin |
Abstract: | For large organisations with numerous organisational units, it can be challenging to keep track of individual events. In a joint project by Data Science for Social Good Berlin e.V. and the Data Science Hub of the German Red Cross, social services were processed over several phases between summer 2022 and summer 2024 using new technologies such as web scraping, data engineering, and natural language processing, and their implementation in various user applications was tested. More than 600, 000 web documents were collected and more than 30, 000 offers were identified. The results of this automated method were compared with the existing data set. Web scraping and subsequent processing are suitable for at least supplementing the previous approach. Web scraping, NLP, and data engineering offer large organisations the opportunity to effectively gain an overview of local events. |
Date: | 2024–09–06 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:3pd4s_v1 |
By: | Moisio, Pasi; Mesiäislehto, Merita; Peltoniemi, Johanna; Pihlajamäki, Mika; Hiilamo, Heikki |
Abstract: | Utilizing Large Language Models (LLM), this study investigates the evolution of an innovative social security policy idea, the General Benefit concept into a policy reform proposal in Fin-land from 2007 to 2023. Drawing from the ideational analysis we hypothesize that political parties struggled over social security conditionality during the 2010s and that social security simplification was manipulated differently in relation to conditionality. Our primary data is elec-tion manifestos and governmental programs from 2007-2023. We employed LLMs, mainly a customized ChatGPT, for the text analysis of policy documents. Additionally, we conduct a critical human evaluation of the LLMs analysis and publish our model in the GPT store for the open replication of analyses. Findings indicate that the weakening of the tripartite industrial relations system and the break-ing of “status quo of three big parties” allowed new parties to influence social policy in 2010s. The General Benefit emerged as a response to calls for social security simplification and for countering (unconditional) basic income proposals. Adopted in 2023, the General Benefit concept aims to merge Finnish universal / residence-based social insurance benefits for the working-aged while preserving core principles like social risk categories and conditionality. Despite increased nativism from the rising True Finns party, and the adoption of universal / unconditional basic income by several parties, Finnish social policy trends from 2007 to 2023 continued to emphasize employment and public finance sustainability. Our study also contributes to methodological discussions on using LLMs in policy analysis. The “human evaluation”, performed by the authors, confirms that the LLM analysis accurately summarises the main features of the policy evolution. However, we also found that the LLM lacks ability to recognise the nuances of “multidimensional” political language and is not very helpful in cross-sectional evaluation, which leaves the analysis partly shallow. Thus, we con-clude that in qualitative policy analysis, LLMs in their current form are suitable for comple-menting rather than substituting human evaluation. |
Date: | 2024–10–10 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:ab8mr_v1 |
By: | David Newhouse; Swindle, Rachel; Wang, Shun; Joshua David Merfeld; Utz Johann Pape; Kibrom Tafere; Michael Weber |
Abstract: | This paper investigates the extent to which real-time indicators derived from internet search, cell phones, and satellites predict changes in household socioeconomic indicators across approximately 300 administrative level-1 regions in 20 countries during the COVID-19 crisis. Measures of changes in socioeconomic status in each region are taken from high-frequency phone surveys. When using the first wave of data, fielded between April and August 2020, models selected using the least absolute shrinkage and selection operator explain 37 percent of the cross-regional variation in the share of households reporting declines in total income and 34 percent of the share of respondents reporting work stoppages since the onset of the crisis. Real-time indicators explain a lower amount of the within-region variation in income losses and current employment over time, with an R2 of 15 percent for current employment and 22 to 26 percent for the prevalence of income declines. When limiting the sample to urban regions, real-time indicators are far more effective at explaining within-region variation in income losses and current employment, with R2 values of approximately 0.54 and 0.38, respectively. Income gains, self-reported food insecurity, social distancing behavior, and child school engagement are more difficult to predict, with R2 values ranging from 0.06 to 0.17. Google search terms related to food, money, jobs, and religion were the most powerful predictors of work stoppage and income declines in the first survey wave, while those related to food, exercise, and religion better tracked changes in income declines and employment over time. Google mobility measures are also strong predictors of changes in employment and the prevalence of specific types of income declines. In general, satellite data on vegetation, pollution, and nighttime lights are far less predictive. Google mobility and search data, and to a lesser extent vegetation and pollution data, can provide a meaningful signal of regional economic distress and recovery, particularly during the early phases of a major crisis such as COVID-19. |
Date: | 2024–09–18 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10916 |
By: | Antonacci, Paulo; Muhammad Khudadad Chattha |
Abstract: | This paper presents an evaluation of a tax enforcement program conducted in Indonesia where officials from the tax authority visited properties to engage directly with owners about their property tax obligations. Through these visits, auditors explained outstanding debts and payment processes, aiming to improve tax compliance and revenue collection. The paper uses an administrative data set and a new set of machine learning–based techniques to assess the program’s effectiveness. The program was responsible for increasing tax compliance on the extensive margin by 4.3 percent and on the intensive margin by 5.1 percent in the first year it was implemented. These effects are particularly strong as they persist in the following period. The findings show that the visited properties had better compliance history, lower value, smaller area, and were more likely to have some construction on them. A key finding from the analysis is that higher-value properties are less sensitive to the visits. In other words, if a data-driven tax-enforcement strategy is to be applied, then it may focus resources on enforcing taxation at the poorest part of the population in this case. This opens up the discussion of the distributional consequences of an algorithm-based enforcement strategy, which is increasingly important as machine learning techniques are used by tax authorities. |
Date: | 2024–09–06 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10901 |
By: | Kamer Ali Yuksel; Hassan Sawaf |
Abstract: | Financial metrics like the Sharpe ratio are pivotal in evaluating investment performance by balancing risk and return. However, traditional metrics often struggle with robustness and generalization, particularly in dynamic and volatile market conditions. This paper introduces AlphaSharpe, a novel framework leveraging large language models (LLMs) to iteratively evolve and optimize financial metrics. AlphaSharpe generates enhanced risk-return metrics that outperform traditional approaches in robustness and correlation with future performance metrics by employing iterative crossover, mutation, and evaluation. Key contributions of this work include: (1) an innovative use of LLMs for generating and refining financial metrics inspired by domain-specific knowledge, (2) a scoring mechanism to ensure the evolved metrics generalize effectively to unseen data, and (3) an empirical demonstration of 3x predictive power for future risk-return forecasting. Experimental results on a real-world dataset highlight the superiority of AlphaSharpe metrics, making them highly relevant for portfolio managers and financial decision-makers. This framework not only addresses the limitations of existing metrics but also showcases the potential of LLMs in advancing financial analytics, paving the way for informed and robust investment strategies. |
Date: | 2025–01 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2502.00029 |
By: | Juha, Sharmin Jahan; Mizan, Arefin |
Abstract: | Bangladesh started its COVID-19 mass vaccination program from February 2020. According to the COVAX live (Live COVID-19 Vaccination Tracker), as of May 2022, 70.26% of the total population have been vaccinated. This is indeed an example of Bangladesh governments' astounding competence that Bangladesh is considered among the first few countries to start vaccinations. However, Bangladesh did experience occasional hiccups in steady vaccine roll-out due to disruptions in the supply chain. Experts condemned Bangladesh's diplomatic choice of relying on only India as a steady vaccine manufacturing source after India decided to temporarily halt vaccinations right before the administration of second doses. Bearing reference to the 'Neighborhood Effect' in International Politics which implies that excessive dependency on geographical neighbors can cause similar levels of instability in both the countries (neighbors), this paper examines Bangladesh's overall Diplomatic approach in its COVID-19 Vaccination program with comparison to its East Asian counterpart Mongolia. Mongolia secured high-ranking position in COVID-19 mass vaccination using its strategic partnerships to pool vaccines from multiple sources as a result of its 2011 multi-pillars Foreign Policy (Third Neighborhood Policy) approach. Using a novel computational multimodal discourse analysis using machine learning assisted techniques in two large hand-collected datasets, the paper delves into the practices implemented by Bangladesh's multi-level stakeholders from the early stages of the pandemic until January this year to find any signs of or impacts of the Neighborhood Effect in its Vaccine Diplomacy. The paper later on makes policy-level suggestions on how to resolve this in case of future health crisis with occasional mention and comparison to Mongolia's Third Neighborhood approach and its implacability in Bangladeshi context. |
Date: | 2025–01–10 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:eg58k_v1 |
By: | Alvero, AJ; antonio, anthony lising; Luqueño, Leslie; Pearman, Francis |
Abstract: | Computational text analysis has grown in popularity among social scientists due to the massive influx of digitized data. However, connecting text to authorship could be a boon for digital demography and expand the scope of computational text analysis from trends of what is being written toward social patterning of the people producing it. We explore this potential through examinations of a large corpus of college admissions essays (n = 254, 820 essays submitted by 83, 538 applicants) and show how personal identity markers and ZIP code-level social context data influence large scale processes of textual production. After generating numerical representations of the essays using computational methods, we model the relationships between different identity and spatial characteristics of applicants and their local communities. We find strong relationships between identity and spatial features with the essays. We also find that individuals whose personal identities are spatially unique--that is, demographically different from others in their immediate content--were most likely to be misclassified, indicating that writing is influenced both socially and spatially. This work clarifies how authorship characteristics shape large scale textual production processes, like college admissions, and complements other large scale analyses of text by focusing on authorship rather than purely textual patterns. |
Date: | 2025–01–31 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:pt6b2_v2 |
By: | Maître, Arnaud T.; Pugachyov, Nikolay; Weigert, Florian |
Abstract: | This paper investigates how investors' abnormal attention affects the cross-section of cryptocurrency returns in the period from 2018 to 2022. We capture abnormal attention using the (log) number of Twitter posts on individual cryptocurrencies on the current day minus a 30-day average. Our results reveal that abnormal attention is positively associated with contemporaneous and one-day ahead crypto performance. Among the different Twitter tweets, return predictability arises due to Ticker-tweets from investors, but not due to tweets from the cryptocurrency channel. These Official-tweets, however, are able to forecast technological innovations on the blockchain. |
Keywords: | Bitcoin, cryptocurrencies, Twitter attention, textual sentiment |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:zbw:cfrwps:311833 |
By: | Ivo Teruggi; Oscar Eduardo Barriga Cabanillas; Walker Kosmidou-Bradley; Silvia Redaelli; Eigo Tateishi |
Abstract: | This study uses nighttime lights to examine the evolution of economic activity in Afghanistan after the August 2021 regime change. A year later, nighttime luminosity had dropped by 20 percent, with two-thirds of this decline tied to the pre-planned international military withdrawal. To focus on local economic activity, the study filters out light emissions from foreign military installations, which accounted for up to 30 percent of lights over the past decade. Using civilian nighttime lights to understand the new economic reality in the country indicates a significant economic recovery concentrated in previously conflict-affected regions. By 2023/24, civilian luminosity had surpassed pre-2020/21 levels by 10.5 percent while, in contrast, official gross domestic product indicates an economy that is one-quarter smaller. The findings highlight changes in economic dynamics, including increased informality, shifts in the geographic distribution of activity, and improved security post-Taliban takeover. |
Date: | 2024–11–06 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10969 |