|
on Big Data |
By: | Magaletti, Nicola; Notarnicola, Valeria; Di Molfetta, Mauro; Mariani, Stefano; Leogrande, Angelo |
Abstract: | The paper investigates the deployment of data analytics and machine learning to improve welding quality in Tecnomulipast srl, a small-to-medium sized manufacturing firm located in Puglia, Italy. The firm produces food machine components and more recently mechanized its laser welding process with the introduction of an IoT-enabled system integrating photographic control. The investment, underwritten by the Apulia Region under PIA (Programmi Integrati di Agevolazione) allowed Tecnomulipast to not only mechanize its production line but also embark upon wider digital transformation. This involved the creation of internal data analytics infrastructures that have the capability to underpin machine learning and artificial intelligence applications. This paper addresses a prediction of weld bead width (LC) with a dataset of 1, 000 observations. Input variables are laser power (PL), pulse time (DI), frequency (FI), beam diameter (DF), focal position (PF), travel speed (VE), trajectory accuracy (TR), laser angle (AN), gas flow (FG), gas purity (PG), ambient temperature (TE), and penetration depth (PE). The parameters were exploited to build and validate some supervised machine learning algorithms like Decision Trees, Random Forest, K-Nearest Neighbors, Support Vector Machines, Neural Networks, and Linear Regression. The performance of the models was measured by MSE, RMSE, MAE, MAPE, and R². Ensemble methods like Random Forest and Boosting performed the highest. Feature importance analysis determined that laser power, gas flow, and trajectory accuracy are the key variables. This project showcases the manner in which Tecnomulipast has benefited from public investment to introduce digital transformation and adopt data-driven strategies within Industry 4.0. |
Keywords: | Tecnomulipast, laser welding, machine learning, digital transformation, Industry 4.0. |
JEL: | C45 C5 C53 L23 O33 |
Date: | 2025–04–24 |
URL: | https://d.repec.org/n?u=RePEc:pra:mprapa:124548 |
By: | Jacob Carlson; Melissa Dell |
Abstract: | This paper presents a general framework for conducting efficient and robust inference on parameters derived from unstructured data, which include text, images, audio, and video. Economists have long incorporated data extracted from texts and images into their analyses, a practice that has accelerated with advancements in deep neural networks. However, neural networks do not generically produce unbiased predictions, potentially propagating bias to estimators that use their outputs. To address this challenge, we reframe inference with unstructured data as a missing structured data problem, where structured data are imputed from unstructured inputs using deep neural networks. This perspective allows us to apply classic results from semiparametric inference, yielding valid, efficient, and robust estimators based on unstructured data. We formalize this approach with MARS (Missing At Random Structured Data), a unifying framework that integrates and extends existing methods for debiased inference using machine learning predictions, linking them to a variety of older, familiar problems such as causal inference. We develop robust and efficient estimators for both descriptive and causal estimands and address challenges such as inference using aggregated and transformed predictions from unstructured data. Importantly, MARS applies to common empirical settings that have received limited attention in the existing literature. Finally, we reanalyze prominent studies that use unstructured data, demonstrating the practical value of MARS. |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2505.00282 |
By: | Giovanni Ballarin; Jacopo Capra; Petros Dellaportas |
Abstract: | Stock return prediction is a problem that has received much attention in the finance literature. In recent years, sophisticated machine learning methods have been shown to perform significantly better than ''classical'' prediction techniques. One downside of these approaches is that they are often very expensive to implement, for both training and inference, because of their high complexity. We propose a return prediction framework for intraday returns at multiple horizons based on Echo State Network (ESN) models, wherein a large portion of parameters are drawn at random and never trained. We show that this approach enjoys the benefits of recurrent neural network expressivity, inherently efficient implementation, and strong forecasting performance. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.19623 |
By: | Nicolas de Roux; Laurent Ferrara |
Abstract: | Officially, the U.S. Federal Reserve has a statutory dual domestic mandate of price stability and full employment, but, in this paper, we question the role of the international environment in shaping Fed monetary policy decisions. In this respect, we use minutes of the Federal Open Market Committee (FOMC) and construct indexes of the attention paid by U.S. monetary policymakers to the international economic and financial situation. These indexes are built by applying natural language processing (NLP) techniques ranging from word count to built-from-scratch machine learning models, to OpenAI's GPT models. By integrating those text-based indicators into a Taylor rule, we derive various quantitative measures of the external influences on Fed decisions. Our results show that when there is a focus on international topics within the FOMC, the Fed’s monetary policy generally tends to be more accommodative than expected by a standard Taylor rule. This result is robust to various alternatives that includes a time-varying neutral interest rate or a shadow central bank interest rate. |
Keywords: | Monetary policy, Federal Reserve, FOMC minutes, International environment, Natural Language Processing, Machine Learning |
JEL: | E52 F42 C54 |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:drm:wpaper:2025-23 |
By: | Marco Zanotti |
Abstract: | In an era of increasing computational capabilities and growing environmental consciousness, organizations face a critical challenge in balancing the accuracy of their forecasting models with computational efficiency and sustainability. Global forecasting models, which leverage data across multiple time series to improve prediction accuracy, lowering the computational time, have gained significant attention over the years. However, the common practice of retraining these models with new observations raises important questions about the costs of producing forecasts. Using ten different machine learning and deep learning models, we analyzed various retraining scenarios, ranging from continuous updates to no retraining at all, across two large retail datasets. We showed that less frequent retraining strategies can maintain the forecast accuracy while reducing the computational costs, providing a more sustainable approach to large-scale forecasting. We also found that machine learning models are a marginally better choice to reduce the costs of forecasting when coupled with less frequent model retraining strategies as the frequency of the data increases. Our findings challenge the conventional belief that frequent retraining is essential for maintaining forecasting accuracy. Instead, periodic retraining offers a good balance between predictive performance and efficiency, both in the case of point and probabilistic forecasting. These insights provide actionable guidelines for organizations seeking to optimize forecasting pipelines while reducing costs and energy consumption. |
Keywords: | Time series, Demand forecasting, Forecasting competitions, Cross-learning, Global models, Machine learning, Deep learning, Green AI, Conformal predictions. |
JEL: | C53 C52 C55 |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:mib:wpaper:551 |
By: | Giovanni Compiani; Ilya Morozov; Stephan Seiler |
Abstract: | We propose a demand estimation method that leverages unstructured text and image data to infer substitution patterns. Using pre-trained deep learning models, we extract embeddings from product images and textual descriptions and incorporate them into a random coefficients logit model. This approach enables researchers to estimate demand even when they lack data on product attributes or when consumers value hard-to-quantify attributes, such as visual design or functional benefits. Using data from a choice experiment, we show that our approach outperforms standard attribute-based models in counterfactual predictions of consumers' second choices. We also apply it across 40 product categories on Amazon and consistently find that text and image data help identify close substitutes within each category. |
Date: | 2025–03 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2503.20711 |
By: | Ziqi Li |
Abstract: | This chapter discusses the opportunities of eXplainable Artificial Intelligence (XAI) within the realm of spatial analysis. A key objective in spatial analysis is to model spatial relationships and infer spatial processes to generate knowledge from spatial data, which has been largely based on spatial statistical methods. More recently, machine learning offers scalable and flexible approaches that complement traditional methods and has been increasingly applied in spatial data science. Despite its advantages, machine learning is often criticized for being a black box, which limits our understanding of model behavior and output. Recognizing this limitation, XAI has emerged as a pivotal field in AI that provides methods to explain the output of machine learning models to enhance transparency and understanding. These methods are crucial for model diagnosis, bias detection, and ensuring the reliability of results obtained from machine learning models. This chapter introduces key concepts and methods in XAI with a focus on Shapley value-based approaches, which is arguably the most popular XAI method, and their integration with spatial analysis. An empirical example of county-level voting behaviors in the 2020 Presidential election is presented to demonstrate the use of Shapley values and spatial analysis with a comparison to multi-scale geographically weighted regression. The chapter concludes with a discussion on the challenges and limitations of current XAI techniques and proposes new directions. |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2505.00591 |
By: | Tiantian Tu |
Abstract: | Time series forecasting is crucial for decision-making across various domains, particularly in financial markets where stock prices exhibit complex and non-linear behaviors. Accurately predicting future price movements is challenging due to the difficulty of capturing both short-term fluctuations and long-term dependencies in the data. Convolutional Neural Networks (CNNs) are well-suited for modeling localized, short-term patterns but struggle with long-range dependencies due to their limited receptive field. In contrast, Transformers are highly effective at capturing global temporal relationships and modeling long-term trends. In this paper, we propose a hybrid architecture that combines CNNs and Transformers to effectively model both short- and long-term dependencies in financial time series data. We apply this approach to forecast stock price movements for S\&P 500 constituents and demonstrate that our model outperforms traditional statistical models and popular deep learning methods in intraday stock price forecasting, providing a robust framework for financial prediction. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.19309 |
By: | Alina Landowska; Robert A. K{\l}opotek; Dariusz Filip; Konrad Raczkowski |
Abstract: | This study examines the relationship between GDP growth and Gross Fixed Capital Formation (GFCF) across developed economies (G7, EU-15, OECD) and emerging markets (BRICS). We integrate Random Forest machine learning (non-linear regression) with traditional econometric models (linear regression) to better capture non-linear interactions in investment analysis. Our findings reveal that while GDP growth positively influences corporate investment, its impact varies significantly by region. Developed economies show stronger GDP-GFCF linkages due to stable financial systems, while emerging markets demonstrate weaker connections due to economic heterogeneity and structural constraints. Random Forest models indicate that GDP growth's importance is lower than suggested by traditional econometrics, with lagged GFCF emerging as the dominant predictor-confirming investment follows path-dependent patterns rather than short-term GDP fluctuations. Regional variations in investment drivers are substantial: taxation significantly influences developed economies but minimally affects BRICS, while unemployment strongly drives investment in BRICS but less so elsewhere. We introduce a parallelized p-value importance algorithm for Random Forest that enhances computational efficiency while maintaining statistical rigor through sequential testing methods (SPRT and SAPT). The research demonstrates that hybrid methodologies combining machine learning with econometric techniques provide more nuanced understanding of investment dynamics, supporting region-specific policy design and improving forecasting accuracy. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.20993 |
By: | Amelie BARBIER-GAUCHARD; Emmanouil SOFIANOS |
Abstract: | The situation of public finance in the eurozone remains a burning issue for certain Euro area countries. The financial markets, the main lenders of the Member States, are more attentive than ever to any factor which could affect the trajectory of public debt in the long term. The risk of bankruptcy of a Member State and a domino effect for the entire monetary union represents the ultimate risk weighing on the Eurozone. This paper aims to forecast the public debt, with a universal model, on a national level within the Euro area. We use a dataset that includes 566 independent variables (economic, financial, institutional, political and social) for 17 Euro area countries, spanning the period from 2000 to 2022 in annual frequency. The dataset is fed to four machine learning (ML) algorithms: Decision Trees, Random Forests, XGBoost and Support Vector Machines (SVM). We also employ the Elastic-Net Regression algorithm from the area of Econometrics. The best model is an XGBoost with an out-of-sample MAPE of 8.41%. Moreover, it outperforms the projections of European Commission and IMF. According to the VIM from XGBoost, the most influential variables are the past values of public debt, the male population in the ages 50-54, the regulatory quality, the control of corruption, the female employment to population ratio for the ages over 15 and the 10 year bond spread. |
Keywords: | Public Debt; Euro Area; Machine Learning; Forecasting. |
JEL: | C53 H63 H68 |
Date: | 2024 |
URL: | https://d.repec.org/n?u=RePEc:ulp:sbbeta:2024-47 |
By: | Castro-Iragorri, Carlos (Universidad del Rosario); Parra-Diaz, Manuel (Universidad del Rosario) |
Abstract: | Recent advances in deep learning have spurred the development of end-to-end frameworks for portfolio optimization that utilize implicit layers. However, many such implementations are highly sensitive to neural network initialization, undermining performance consistency. This research introduces a robust end-to-end framework tailored for risk budgeting portfolios that effectively reduces sensitivity to initialization. Importantly, this enhanced stability does not compromise portfolio performance, as our framework consistently outperforms the risk parity benchmark. |
Keywords: | end-to-end framework; neural networks; risk budgeting; stability |
JEL: | C13 C45 G11 |
Date: | 2025–03–05 |
URL: | https://d.repec.org/n?u=RePEc:col:000092:021367 |
By: | Manuel Parra-Diaz; Carlos Castro-Iragorri |
Abstract: | Recent advances in deep learning have spurred the development of end-to-end frameworks for portfolio optimization that utilize implicit layers. However, many such implementations are highly sensitive to neural network initialization, undermining performance consistency. This research introduces a robust end-to-end framework tailored for risk budgeting portfolios that effectively reduces sensitivity to initialization. Importantly, this enhanced stability does not compromise portfolio performance, as our framework consistently outperforms the risk parity benchmark. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.19980 |
By: | Prashant Garg; Thiemo Fetzer |
Abstract: | Using basic health statements authorized by UK and EU registers and 9, 100 journalist-vetted public-health assertions on topics such as abortion, COVID-19 and politics from sources ranging from peer-reviewed journals and government advisories to social media and news across the political spectrum, we benchmark six leading large language models from in 21 languages, finding that, despite high accuracy on English-centric textbook claims, performance falls in multiple non-European languages and fluctuates by topic and source, highlighting the urgency of comprehensive multilingual, domain-aware validation before deploying AI in global health communication. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.18310 |
By: | Hirmer, Christine (Institute for Employment Research (IAB), Nuremberg, Germany); Metzger, Lina-Jeanette (Institute for Employment Research (IAB), Nuremberg, Germany) |
Abstract: | "The constantly growing amount of digitally accessible text data and advances in natural language processing (NLP) have made text mining a key technology. The "Temi-Box" is a modular construction kit designed for text mining applications. It enables automated text classification, topic categorization, and clustering without necessitating extensive programming expertise. Developed on the basis of the keywording and topic assignment of publications for the IAB Info Platform and financed by EU funds, it is available as an open source project. This research report documents the development and application of the Temi-Box and illustrates its use and the interpretation of the results obtained. Text mining extracts knowledge from unstructured texts using methods such as classification and clustering. The modular Temi-Box provides users with established methods in a user-friendly way and supports users with a pipeline architecture that simplifies standardised processes such as data preparation and model training. It incorporates both current and traditional approaches to text representation, such as BERT and TF-IDF, and offers a variety of algorithms for text classification and clustering, including K-Nearest Neighbors (KNN), binary and multinomial classifiers as layers in neural networks and K-Means. Various evaluation metrics facilitate the assessment of model performance and the comparison of different approaches. Experiments on automated topic assignment and the identification of key topics illustrate the use of the Temi-Box and the interpretation of the results. Based on a dataset with 1, 932 IAB publications and 105 topics, the results show that BERT-based models, such as GermanBERT, consistently achieve the best results. Binary classifiers prove to be particularly flexible and accurate, while TF-IDF-based approaches offer robust alternatives with less complexity. Clustering remains a challenge, especially when content overlaps. The Temi-Box is a highly versatile instrument. In addition to the application for the IAB Info Platform described in this research report, it can be used in numerous areas, such as the analysis of job advertisements, job and company descriptions, keywording of publications or for sentiment analysis. It can also be extended for use in question-and-answer systems or for named entity recognition. The Temi-Box facilitates the application of text mining methods for a broad user base and offers numerous customization options. It reduces the effort involved in developing and comparing models. Its open source availability promotes the further development and integration of the Temi-Box into various research projects. This enables users to adapt the platform to specific needs and integrate new functions. The report shows the potential of the Temi-Box to advance the digitization and automation of text data analysis. At the same time, challenges such as ensuring data quality and the interpretability of the models remain. These aspects require continuous validation and further development in order to further improve the effectiveness and reliability of text mining methods." (Author's abstract, IAB-Doku) ((en)) |
Keywords: | Bundesrepublik Deutschland ; IAB-Open-Access-Publikation ; Natural Language Processing ; Automatisierung ; Datenanalyse ; IAB ; Indexierung ; Klassifikation ; Anwendung ; Text Mining ; Veröffentlichung |
Date: | 2025–05–08 |
URL: | https://d.repec.org/n?u=RePEc:iab:iabfob:202513 |
By: | Joao Felipe Gueiros; Hemanth Chandravamsi; Steven H. Frankel |
Abstract: | This paper explores the use of deep residual networks for pricing European options on Petrobras, one of the world's largest oil and gas producers, and compares its performance with the Black-Scholes (BS) model. Using eight years of historical data from B3 (Brazilian Stock Exchange) collected via web scraping, a deep learning model was trained using a custom built hybrid loss function that incorporates market data and analytical pricing. The data for training and testing were drawn between the period spanning November 2016 to January 2025, using an 80-20 train-test split. The test set consisted of data from the final three months: November, December, and January 2025. The deep residual network model achieved a 64.3\% reduction in the mean absolute error for the 3-19 BRL (Brazilian Real) range when compared to the Black-Scholes model on the test set. Furthermore, unlike the Black-Scholes solution, which tends to decrease its accuracy for longer periods of time, the deep learning model performed accurately for longer expiration periods. These findings highlight the potential of deep learning in financial modeling, with future work focusing on specialized models for different price ranges. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.20088 |
By: | Kyungsu Kim |
Abstract: | This thesis studies the effectiveness of Long Short Term Memory model in forecasting future Job Openings and Labor Turnover Survey data in the United States. Drawing on multiple economic indicators from various sources, the data are fed directly into LSTM model to predict JOLT job openings in subsequent periods. The performance of the LSTM model is compared with conventional autoregressive approaches, including ARIMA, SARIMA, and Holt-Winters. Findings suggest that the LSTM model outperforms these traditional models in predicting JOLT job openings, as it not only captures the dependent variables trends but also harmonized with key economic factors. These results highlight the potential of deep learning techniques in capturing complex temporal dependencies in economic data, offering valuable insights for policymakers and stakeholders in developing data-driven labor market strategies |
Date: | 2025–03 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2503.19048 |
By: | Venkat Ram Reddy Ganuthula; Krishna Kumar Balaraman |
Abstract: | As artificial intelligence becomes increasingly integrated into professional and personal domains, traditional metrics of human intelligence require reconceptualization. This paper introduces the Artificial Intelligence Quotient (AIQ), a novel measurement framework designed to assess an individual's capacity to effectively collaborate with and leverage AI systems, particularly Large Language Models (LLMs). Building upon established cognitive assessment methodologies and contemporary AI interaction research, we present a comprehensive framework for quantifying human-AI collaborative intelligence. This work addresses the growing need for standardized evaluation of AI-augmented cognitive capabilities in educational and professional contexts. |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2503.16438 |
By: | Sugat Chaturvedi; Rochana Chaturvedi |
Abstract: | Generative artificial intelligence (AI), particularly large language models (LLMs), is being rapidly deployed in recruitment and for candidate shortlisting. We audit several mid-sized open-source LLMs for gender bias using a dataset of 332, 044 real-world online job postings. For each posting, we prompt the model to recommend whether an equally qualified male or female candidate should receive an interview callback. We find that most models tend to favor men, especially for higher-wage roles. Mapping job descriptions to the Standard Occupational Classification system, we find lower callback rates for women in male-dominated occupations and higher rates in female-associated ones, indicating occupational segregation. A comprehensive analysis of linguistic features in job ads reveals strong alignment of model recommendations with traditional gender stereotypes. To examine the role of recruiter identity, we steer model behavior by infusing Big Five personality traits and simulating the perspectives of historical figures. We find that less agreeable personas reduce stereotyping, consistent with an agreeableness bias in LLMs. Our findings highlight how AI-driven hiring may perpetuate biases in the labor market and have implications for fairness and diversity within firms. |
Date: | 2025–04 |
URL: | https://d.repec.org/n?u=RePEc:arx:papers:2504.21400 |
By: | Lars Hornuf; David J. Streich; Niklas Töllich |
Abstract: | Retrieval-augmented generation (RAG) has emerged as a promising way to improve task-specific performance in generative artificial intelligence (GenAI) applications such as large language models (LLMs). In this study, we evaluate the performance implications of providing various types of domain-specific information to LLMs in a simple portfolio allocation task. We compare the recommendations of seven state-of-the-art LLMs in various experimental conditions against a benchmark of professional financial advisors. Our main result is that the provision of domain-specific information does not unambiguously improve the quality of recommendations. In particular, we find that LLM recommendations underperform recommendations by human financial advisors in the baseline condition. However, providing firm-specific information improves historical performance in LLM portfolios and closes the gap with human advisors. Performance improvements are achieved through higher exposure to market risk and not through an increase in mean-variance efficiency within the risky portfolio share. Notably, portfolio risk increases primarily for risk-averse investors. We also document that quantitative firm-specific information affects recommendations more than qualitative firm-specific information, and that equipping models with generic finance theory does not affect recommendations. |
Keywords: | generative artificial intelligence, large language models, domain-specific information, retrieval-augmented generation, portfolio management, portfolio allocation. |
JEL: | G00 G11 G40 |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:ces:ceswps:_11862 |
By: | Jennifer Priefer Author-1-Name-First: Jennifer Author-1-Name-Last: Priefer (Paderborn University); Jan-Peter Kucklick Author-2-Name-First: Jan-Peter Author-2-Name-Last: Kucklick (Paderborn University); Daniel Beverungen Author-3-Name-First: Daniel Author-3-Name-Last: Beverungen (Paderborn University); Oliver Müller Author-3-Name-First: Oliver Author-3-Name-Last: Müller (Paderborn University) |
Abstract: | Information systems have proven their value in facilitating pricing decisions. Still, predicting prices for complex goods, such as houses, remains challenging due to information asymmetries that obscure their qualities. Beyond search qualities that sellers can identify before a purchase, complex goods also possess experience qualities only identifiable ex-post. While research has discussed how information asymmetries cause market failure, it remains unclear how information systems can account for search and experience qualities of complex goods to enable their pricing in online markets. In a machine learning-based study, we quantify their predictive power for online real estate pricing, using geographic information systems and computer vision to incorporate spatial and image data into price prediction. We find that leveraging these secondary use data can transform some experience qualities into search qualities, increasing predictive power by up to 15.4%. We conclude that spatial and image data can provide valuable resources for improving price predictions for complex goods. |
Keywords: | information asymmetries; real estate appraisal; SEC theory; machine learning; geographic information systems; computer vision |
JEL: | C53 D82 R31 |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:pdn:dispap:138 |
By: | Michaela Kecskésová (Department of Economics, Masaryk University, Lipová 41a, 60200 Brno, Czech Republic); Štěpán Mikula (Department of Economics, Masaryk University, Lipová 41a, 60200 Brno, Czech Republic) |
Abstract: | This paper investigates the effects of air pollution on public mood using sentiment analysis of geolocated social media data. Analyzing approximately 7 million twitter posts from the United States in July 2015, we examine how fluctuations in air quality caused by Canadian wildfires influence sentiment. We find robust evidence that higher exposure to particulate matter leads to decreased positive sentiment and increased negative sentiment. Given the importance of mood as a factor in labor productivity, our results suggest that the short-term psychological effects of air pollution, alongside its well-documented physical health impacts, should be considered in policy discussions, as negative shifts in public mood due to poor air quality could have far-reaching economic consequences. |
Keywords: | air pollution; particulate matter; mood; sentiment analysis; Twitter; wildfires |
JEL: | Q5 D9 |
Date: | 2025–05 |
URL: | https://d.repec.org/n?u=RePEc:mub:wpaper:2025-05 |