nep-big New Economics Papers
on Big Data
Issue of 2024‒07‒29
twenty-one papers chosen by
Tom Coupé, University of Canterbury


  1. Developing an International Macroeconomic Forecasting Model Based on Big Data By Yoon, Sang-Ha
  2. Machine Learning for Economic Forecasting: An Application to China's GDP Growth By Yanqing Yang; Xingcheng Xu; Jinfeng Ge; Yan Xu
  3. Impact of Sentiment analysis on Energy Sector Stock Prices : A FinBERT Approach By Sarra Ben Yahia; Jose Angel Garcia Sanchez; Rania Hentati Kaffel
  4. GraphCNNpred: A stock market indices prediction using a Graph based deep learning system By Yuhui Jin
  5. Investigating Factors Influencing Dietary Quality in China: Machine Learning Approaches By Feng, Yuan; Liu, Shuang; Zhang, Man; Jin, Yanhong; Yu, Xiaohua
  6. Using Machine Learning Method to Estimate the Heterogeneous Impacts of the Updated Nutrition Facts Panel By Zhang, Yuxiang; Liu, Yizao; Sears, James M.
  7. Evolution of Spatial Drivers for Oil Palm Expansion over Time: Insights from Spatiotemporal Data and Machine Learning Models By Zhao, Jing; Cochrane, Mark; Zhang, Xin; Elmore, Andrew; Lee, Janice; Su, Ye
  8. What Teaches Robots to Walk, Teaches Them to Trade too -- Regime Adaptive Execution using Informed Data and LLMs By Raeid Saqur
  9. F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data By Zexing Xu; Linjun Zhang; Sitan Yang; Rasoul Etesami; Hanghang Tong; Huan Zhang; Jiawei Han
  10. Big data in economics By Bogdan Oancea
  11. Predicting the Validity and Reliability of Survey Questions By Felderer, Barbara; Repke, Lydia; Weber, Wiebke; Schweisthal, jonas; Bothmann, Ludwig
  12. Improving Realized LGD Approximation: A Novel Framework with XGBoost for Handling Missing Cash-Flow Data By Zuzanna Kostecka; Robert \'Slepaczuk
  13. Statistical arbitrage in multi-pair trading strategy based on graph clustering algorithms in US equities market By Adam Korniejczuk; Robert \'Slepaczuk
  14. LABOR-LLM: Language-Based Occupational Representations with Large Language Models By Tianyu Du; Ayush Kanodia; Herman Brunborg; Keyon Vafa; Susan Athey
  15. Breastfeeding and Child Development Outcomes across Early Childhood and Adolescence: Doubly Robust Estimation with Machine Learning By Khudri, Md Mohsan; Hussey, Andrew
  16. News Deja Vu: Connecting Past and Present with Semantic Search By Brevin Franklin; Emily Silcock; Abhishek Arora; Tom Bryan; Melissa Dell
  17. Tracking Trends in Topics of Agricultural and Applied Economics Discourse over the Last Century Using Natural Language Processing By Lee, Jacob W.; Elliott, Brendan; Lam, Aaron; Gupta, Neha; Wilson, Norbert L.W.; Collins, Leslie M.; Mainsah, Boyla
  18. Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approach By Orson Mengara
  19. Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data By Alexander Quispe; Rodrigo Grijalba
  20. $\text{Alpha}^2$: Discovering Logical Formulaic Alphas using Deep Reinforcement Learning By Feng Xu; Yan Yin; Xinyu Zhang; Tianyuan Liu; Shengyi Jiang; Zongzhang Zhang
  21. Contrastive Entity Coreference and Disambiguation for Historical Texts By Abhishek Arora; Emily Silcock; Leander Heldring; Melissa Dell

  1. By: Yoon, Sang-Ha (KOREA INSTITUTE FOR INTERNATIONAL ECONOMIC POLICY (KIEP))
    Abstract: In the era of big data, economists are exploring new data sources and methodologies to improve economic forecasting. This study examines the potential of big data and machine learning in enhancing the predictive power of international macroeconomic forecasting models. The research utilizes both structured and unstructured data to forecast Korea's GDP growth rate. For structured data, around 200 macroeconomic and financial indicators from Korea and the U.S. were used with machine learning techniques (Random Forest, XGBoost, LSTM) and ensemble models. Results show that machine learning generally outperforms traditional econometric models, particularly for one-quarter-ahead forecasts, although performance varies by country and period. For unstructured data, the study uses Naver search data as a proxy for public sentiment. Using Dynamic Model Averaging and Selection (DMA and DMS) techniques, it incorporates eight Naver search indices alongside traditional macroeconomic variables. The findings suggest that online search data improves predictive power, especially in capturing economic turning points. The study also compares these big data-driven models with a Dynamic Stochastic General Equilibrium (DSGE) model. While DSGE offers policy analysis capabilities, its in-sample forecasts make direct comparison difficult. However, DMA and DMS models using search indices seem to better capture the GDP plunge in 2020. Based on the research findings, the author offers several suggestions to maximize the potential of big data. He stresses the importance of discovering and constructing diverse data sources, while also developing new analytical techniques such as machine learning. Furthermore, he suggests that big data models can be used as auxiliary indicators to complement existing forecasting models, and proposes that combining structural models with big data methodologies could create synergistic effects. Lastly, by using text mining on various online sources to build comprehensive databases, we can secure richer and more real-time economic data. These suggestions demonstrate the significant potential of big data in improving the accuracy of international macroeconomic forecasting, particularly emphasizing its effectiveness in situations where the economy is undergoing rapid changes.
    Keywords: International Macroeconomic Forecasting Model; Big Data
    Date: 2024–06–14
    URL: https://d.repec.org/n?u=RePEc:ris:kiepwe:2024_018
  2. By: Yanqing Yang; Xingcheng Xu; Jinfeng Ge; Yan Xu
    Abstract: This paper aims to explore the application of machine learning in forecasting Chinese macroeconomic variables. Specifically, it employs various machine learning models to predict the quarterly real GDP growth of China, and analyzes the factors contributing to the performance differences among these models. Our findings indicate that the average forecast errors of machine learning models are generally lower than those of traditional econometric models or expert forecasts, particularly in periods of economic stability. However, during certain inflection points, although machine learning models still outperform traditional econometric models, expert forecasts may exhibit greater accuracy in some instances due to experts' more comprehensive understanding of the macroeconomic environment and real-time economic variables. In addition to macroeconomic forecasting, this paper employs interpretable machine learning methods to identify the key attributive variables from different machine learning models, aiming to enhance the understanding and evaluation of their contributions to macroeconomic fluctuations.
    Date: 2024–07
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2407.03595
  3. By: Sarra Ben Yahia (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique); Jose Angel Garcia Sanchez (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique); Rania Hentati Kaffel (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique)
    Abstract: This study provides sentiment analysis model to enhance market return forecasts by considering investor sentiment from social media platforms like Twitter (X). We leverage advanced NLP techniques and large language models to analyze sentiment from financial tweets. We use a large web-scrapped data of selected energy stock daily returns spanning from 2018 to 2023. Sentiment scores derived from FinBERT are integrated into a novel predictive model (SIMDM) to evaluate autocorrelation structures within both the sentiment scores and stock returns data. Our findings reveal i) significant correlations between sentiment scores and stock prices. ii) Results are highly sensitive to data quality. iii) Our study reinforces the concept of market efficiency and offers empirical evidence regarding the delayed influence of emotional states on stock returns.
    Keywords: financial NLP finBERT information extraction webscraping sentiment analysis, financial NLP, finBERT, information extraction, webscraping, sentiment analysis, LLM, Deep learing
    Date: 2024–06–30
    URL: https://d.repec.org/n?u=RePEc:hal:cesptp:hal-04629569
  4. By: Yuhui Jin
    Abstract: Deep learning techniques for predicting stock market prices is an popular topic in the field of data science. Customized feature engineering arises as pre-processing tools of different stock market dataset. In this paper, we give a graph neural network based convolutional neural network (CNN) model, that can be applied on diverse source of data, in the attempt to extract features to predict the trends of indices of \text{S}\&\text{P} 500, NASDAQ, DJI, NYSE, and RUSSEL.
    Date: 2024–07
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2407.03760
  5. By: Feng, Yuan; Liu, Shuang; Zhang, Man; Jin, Yanhong; Yu, Xiaohua
    Keywords: Food Consumption/Nutrition/Food Safety
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea22:343836
  6. By: Zhang, Yuxiang; Liu, Yizao; Sears, James M.
    Keywords: Food Consumption/Nutrition/Food Safety, Health Economics And Policy, Consumer/ Household Economics
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea22:343727
  7. By: Zhao, Jing; Cochrane, Mark; Zhang, Xin; Elmore, Andrew; Lee, Janice; Su, Ye
    Keywords: Land Economics/Use, Environmental Economics And Policy, Community/Rural/Urban Development
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea22:344016
  8. By: Raeid Saqur
    Abstract: Machine learning techniques applied to the problem of financial market forecasting struggle with dynamic regime switching, or underlying correlation and covariance shifts in true (hidden) market variables. Drawing inspiration from the success of reinforcement learning in robotics, particularly in agile locomotion adaptation of quadruped robots to unseen terrains, we introduce an innovative approach that leverages world knowledge of pretrained LLMs (aka. 'privileged information' in robotics) and dynamically adapts them using intrinsic, natural market rewards using LLM alignment technique we dub as "Reinforcement Learning from Market Feedback" (**RLMF**). Strong empirical results demonstrate the efficacy of our method in adapting to regime shifts in financial markets, a challenge that has long plagued predictive models in this domain. The proposed algorithmic framework outperforms best-performing SOTA LLM models on the existing (FLARE) benchmark stock-movement (SM) tasks by more than 15\% improved accuracy. On the recently proposed NIFTY SM task, our adaptive policy outperforms the SOTA best performing trillion parameter models like GPT-4. The paper details the dual-phase, teacher-student architecture and implementation of our model, the empirical results obtained, and an analysis of the role of language embeddings in terms of Information Gain.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.15508
  9. By: Zexing Xu; Linjun Zhang; Sitan Yang; Rasoul Etesami; Hanghang Tong; Huan Zhang; Jiawei Han
    Abstract: Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.16221
  10. By: Bogdan Oancea
    Abstract: The term of big data was used since 1990s, but it became very popular around 2012. A recent definition of this term says that big data are information assets characterized by high volume, velocity, variety and veracity that need special analytical methods and software technologies to extract value form them. While big data was used at the beginning mostly in information technology field, now it can be found in every area of activity: in governmental decision-making processes, manufacturing, education, healthcare, economics, engineering, natural sciences, sociology. The rise of Internet, mobile phones, social media networks, different types of sensors or satellites provide enormous quantities of data that can have profound effects on economic research. The data revolution that we are facing transformed the way we measure the human behavior and economic activities. Unemployment, consumer price index, population mobility, financial transactions are only few examples of economic phenomena that can be analyzed using big data sources. In this paper we will start with a taxonomy of big data sources and show how these new data sources can be used in empirical analyses and to build economic indicators very fast and with reduced costs.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.11913
  11. By: Felderer, Barbara; Repke, Lydia; Weber, Wiebke; Schweisthal, jonas; Bothmann, Ludwig
    Abstract: The Survey Quality Predictor (SQP) is an open-access system to predict the quality, i.e., the reliability and validity, of survey questions based on the characteristics of the questions. The prediction is based on a meta-regression of many multitrait-multimethod (MTMM) experiments in which characteristics of the survey questions were systematically varied. The release of SQP 3.0 that is based on an expanded data base as compared to previous SQP versions raised the need for a new meta-regression. To find the best method for analyzing the complex data structure of SQP (e.g., the existence of various uncorrelated predictors), we compared four suitable machine learning methods in terms of their ability to predict both survey quality indicators: LASSO, elastic net, boosting and random forest. The article discusses the performance of the models and illustrates the importance of the individual item characteristics in the random forest model, which was chosen for SQP 3.0.
    Date: 2024–06–27
    URL: https://d.repec.org/n?u=RePEc:osf:osfxxx:hkngd
  12. By: Zuzanna Kostecka; Robert \'Slepaczuk
    Abstract: The scope for the accurate calculation of the Loss Given Default (LGD) parameter is comprehensive in terms of financial data. In this research, we aim to explore methods for improving the approximation of realized LGD in conditions of limited access to the cash-flow data. We enhance the performance of the method which relies on the differences between exposure values (delta outstanding approach) by employing machine learning (ML) techniques. The research utilizes the data from the mortgage portfolio of one of the European countries and assumes a close resemblance to similar economic contexts. It incorporates non-financial variables and macroeconomic data related to the housing market, improving the accuracy of loss severity approximation. The proposed methodology attempts to mitigate the country-specific (related to the local legal) or portfolio-specific factors in aim to show the general advantage of applying ML techniques, rather than case-specific relation. We developed an XGBoost model that does not rely on cash-flow data yet enhances the accuracy of realized LGD estimation compared to results obtained with the delta outstanding approach. A novel aspect of our work is the detailed exploration of the delta outstanding approach and the methodology for addressing conditions of limited access to cash-flow data through machine learning models.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.17308
  13. By: Adam Korniejczuk; Robert \'Slepaczuk
    Abstract: The study seeks to develop an effective strategy based on the novel framework of statistical arbitrage based on graph clustering algorithms. Amalgamation of quantitative and machine learning methods, including the Kelly criterion, and an ensemble of machine learning classifiers have been used to improve risk-adjusted returns and increase immunity to transaction costs over existing approaches. The study seeks to provide an integrated approach to optimal signal detection and risk management. As a part of this approach, innovative ways of optimizing take profit and stop loss functions for daily frequency trading strategies have been proposed and tested. All of the tested approaches outperformed appropriate benchmarks. The best combinations of the techniques and parameters demonstrated significantly better performance metrics than the relevant benchmarks. The results have been obtained under the assumption of realistic transaction costs, but are sensitive to changes in some key parameters.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.10695
  14. By: Tianyu Du; Ayush Kanodia; Herman Brunborg; Keyon Vafa; Susan Athey
    Abstract: Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based "foundation model", CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.17972
  15. By: Khudri, Md Mohsan (Austin Community College); Hussey, Andrew (University of Memphis)
    Abstract: Using data from the Panel Study of Income Dynamics, we estimate the impact of breastfeeding initiation and duration on multiple cognitive, health, and behavioral outcomes spanning early childhood through adolescence. To mitigate the potential bias from misspecification, we employ a doubly robust (DR) estimation method, addressing misspecification in either the treatment or outcome models while adjusting for selection effects. Our novel approach is to use and evaluate a battery of supervised machine learning (ML) algorithms to improve propensity score (PS) estimates. We demonstrate that the gradient boosting machine (GBM) algorithm removes bias more effectively and minimizes other prediction errors compared to logit and probit models as well as alternative ML algorithms. Across all outcomes, our DR-GBM estimation generally yields lower estimates than OLS, DR, and PS matching using standard and alternative ML algorithms and even sibling fixed effects estimates. We find that having been breastfed is significantly linked to multiple improved early cognitive outcomes, though the impact reduces somewhat with age. In contrast, we find mixed evidence regarding the impact of breastfeeding on non-cognitive (health and behavioral) outcomes, with effects being most pronounced in adolescence. Our results also suggest relatively higher cognitive benefits for children of minority mothers and children of mothers with at least some post-high school education, and minimal marginal benefits of breastfeeding duration beyond 12 months for cognitive outcomes and 6 months for non-cognitive outcomes.
    Keywords: breastfeeding, human capital, cognitive and non-cognitive outcomes, doubly robust estimation, machine learning
    JEL: I12 I18 J13 J24 C21 C63
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:iza:izadps:dp17080
  16. By: Brevin Franklin; Emily Silcock; Abhishek Arora; Tom Bryan; Melissa Dell
    Abstract: Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with keywords, can be brittle given complex vocabularies and OCR noise. This study introduces News Deja Vu, a novel semantic search tool that leverages transformer large language models and a bi-encoder approach to identify historical news articles that are most similar to modern news queries. News Deja Vu first recognizes and masks entities, in order to focus on broader parallels rather than the specific named entities being discussed. Then, a contrastively trained, lightweight bi-encoder retrieves historical articles that are most similar semantically to a modern query, illustrating how phenomena that might seem unique to the present have varied historical precedents. Aimed at social scientists, the user-friendly News Deja Vu package is designed to be accessible for those who lack extensive familiarity with deep learning. It works with large text datasets, and we show how it can be deployed to a massive scale corpus of historical, open-source news articles. While human expertise remains important for drawing deeper insights, News Deja Vu provides a powerful tool for exploring parallels in how people have perceived past and present.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.15593
  17. By: Lee, Jacob W.; Elliott, Brendan; Lam, Aaron; Gupta, Neha; Wilson, Norbert L.W.; Collins, Leslie M.; Mainsah, Boyla
    Keywords: Research Methods/Statistical Methods, Agricultural And Food Policy, Teaching/Communication/Extension/Profession
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea22:343814
  18. By: Orson Mengara
    Abstract: With the growing use of voice-activated systems and speech recognition technologies, the danger of backdoor attacks on audio data has grown significantly. This research looks at a specific type of attack, known as a Stochastic investment-based backdoor attack (MarketBack), in which adversaries strategically manipulate the stylistic properties of audio to fool speech recognition systems. The security and integrity of machine learning models are seriously threatened by backdoor attacks, in order to maintain the reliability of audio applications and systems, the identification of such attacks becomes crucial in the context of audio data. Experimental results demonstrated that MarketBack is feasible to achieve an average attack success rate close to 100% in seven victim models when poisoning less than 1% of the training data.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.10719
  19. By: Alexander Quispe; Rodrigo Grijalba
    Abstract: Advancements in Artificial Intelligence, particularly with ChatGPT, have significantly impacted software development. Utilizing novel data from GitHub Innovation Graph, we hypothesize that ChatGPT enhances software production efficiency. Utilizing natural experiments where some governments banned ChatGPT, we employ Difference-in-Differences (DID), Synthetic Control (SC), and Synthetic Difference-in-Differences (SDID) methods to estimate its effects. Our findings indicate a significant positive impact on the number of git pushes, repositories, and unique developers per 100, 000 people, particularly for high-level, general purpose, and shell scripting languages. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.11046
  20. By: Feng Xu; Yan Yin; Xinyu Zhang; Tianyuan Liu; Shengyi Jiang; Zongzhang Zhang
    Abstract: Alphas are pivotal in providing signals for quantitative trading. The industry highly values the discovery of formulaic alphas for their interpretability and ease of analysis, compared with the expressive yet overfitting-prone black-box alphas. In this work, we focus on discovering formulaic alphas. Prior studies on automatically generating a collection of formulaic alphas were mostly based on genetic programming (GP), which is known to suffer from the problems of being sensitive to the initial population, converting to local optima, and slow computation speed. Recent efforts employing deep reinforcement learning (DRL) for alpha discovery have not fully addressed key practical considerations such as alpha correlations and validity, which are crucial for their effectiveness. In this work, we propose a novel framework for alpha discovery using DRL by formulating the alpha discovery process as program construction. Our agent, $\text{Alpha}^2$, assembles an alpha program optimized for an evaluation metric. A search algorithm guided by DRL navigates through the search space based on value estimates for potential alpha outcomes. The evaluation metric encourages both the performance and the diversity of alphas for a better final trading strategy. Our formulation of searching alphas also brings the advantage of pre-calculation dimensional analysis, ensuring the logical soundness of alphas, and pruning the vast search space to a large extent. Empirical experiments on real-world stock markets demonstrates $\text{Alpha}^2$'s capability to identify a diverse set of logical and effective alphas, which significantly improves the performance of the final trading strategy. The code of our method is available at https://github.com/x35f/alpha2.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.16505
  21. By: Abhishek Arora; Emily Silcock; Leander Heldring; Melissa Dell
    Abstract: Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases. This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled historical newswire articles, and trained models evaluated on this historical benchmark. We contrastively train bi-encoder models for coreferencing and disambiguating individuals in historical texts, achieving accurate, scalable performance that identifies out-of-knowledgebase individuals. Our approach significantly surpasses other entity disambiguation models on our historical newswire benchmark. Our models also demonstrate competitive performance on modern entity disambiguation benchmarks, particularly certain news disambiguation datasets.
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2406.15576

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.