|
on Big Data |
By: | João Amador; Paulo Barbosa; João Cortes |
Abstract: | This paper studies firms’ distances to becoming successful exporters. The empirical exercise uses rich data on Portuguese firms and assumes that there are significant features distinguishing exporters from non-exporters. An array of machine learning models—Bayesian Additive Regression Tree (BART), Missingness Not at Random (BART-MIA), Random Forest, Logit Regression, and Neural Networks—are trained to predict firms’ export probability and to shed light on the critical factors driving the transition to successful export ventures. Neural Networks outperform the other models and remain highly accurate when export definitions and training and testing strategies are changed. We show that the most influential variables for prediction are labor productivity and the share of imports from the EU in total purchases. Additionally, firms at the median distance to sell in international markets operate with about twice the assets of the group in the decile more distance from exporting. Firms in the decile closest to the export market operate with around 12 times more assets than those in the decile more distant from exporting. |
JEL: | C53 C55 L2 |
Date: | 2024 |
URL: | https://d.repec.org/n?u=RePEc:ptu:wpaper:w202420 |
By: | Chenyu Hou (Simon Fraser University) |
Abstract: | Most empirical studies on expectation formation models share a common dynamic structure but impose different functional form restrictions. I propose a flexible non-parametric method that maintains this dynamic structure to estimate a model of expectation formation using Recurrent Neural Networks. Applying this approach to data on macroeconomic expectations from the Michigan Survey of Consumers and a rich set of signals, I document three novel findings: (1) agents’ expectations about the future economic condition have asymmetric and non-linear responses to signals; (2) agents’ attentions shift from signals about the current state to signals about the future as the economic condition deteriorates ; (3) the content of signals on economic conditions plays the most important role in creating the attention-shift. Double Machine Learning approach is used to obtain statistical inferences of these empirical findings. Finally, I show these stylized facts can be generated by a model with rational inattention, in which information endogenously becomes more valuable when economic status worsens. |
Date: | 2023–04 |
URL: | https://d.repec.org/n?u=RePEc:sfu:sfudps:dp23-13 |
By: | Ashwin, Julian; Chhabra, Aditya; Rao, Vijayendra |
Abstract: | Large Language Models (LLMs) are quickly becoming ubiquitous, but the implications for social science research are not yet well understood. This paper asks whether LLMs can help us analyse large-N qualitative data from open-ended interviews, with an application to transcripts of interviews with displaced Rohingya people in Cox’s Bazaar, Bangladesh. The analysis finds that a great deal of caution is needed in using LLMs to annotate text as there is a risk of introducing biases that can lead to misleading inferences. Here this refers to bias in the technical sense, that the errors that LLMs make in annotating interview transcripts are not random with respect to the characteristics of the interview subjects. Training simpler supervised models on high-quality human annotations with flexible coding leads to less measurement error and bias than LLM annotations. Therefore, given that some high quality annotations are necessary in order to asses whether an LLM introduces bias, this paper argues that it is probably preferable to train a bespoke model on these annotations than it is to use an LLM for annotation. |
Date: | 2023–11–07 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10597 |
By: | Letta, Marco; Montalbano, Pierluigi; Paolantonio, Adriana |
Abstract: | The complex relationship between climate shocks, migration, and adaptation hampers a rigorous understanding of the heterogeneous mobility outcomes of farm households exposed to climate risk. To unpack this heterogeneity, the analysis combines longitudinal multi-topic household survey data from Nigeria with a causal machine learning approach, tailored to a conceptual framework bridging economic migration theory and the poverty traps literature. The results show that pre-shock asset levels, in situ adaptive capacity, and cumulative shock exposure drive not just the magnitude but also the sign of the impact of agriculture-relevant weather anomalies on the mobility outcomes of farming households. While local adaptation acts as a substitute for migration, the roles played by wealth constraints and repeated shock exposure suggest the presence of climate-induced immobility traps. |
Date: | 2024–03–18 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10724 |
By: | Reiss, Michael V.; Roggenkamp, Hauke |
Abstract: | We examine the reproducibility and robustness of the central claims from Robertson et al. (2023) who investigate the impact of negative language on online news consumption by analyzing over 12, 448 randomized controlled trials on upworthy.com. Applying "lexical" sentiment analyses, the authors make two central claims: first, they find that headlines with negative words significantly increase click-through rates (CTR). Second, they find that positive words in a headline reduce a news headline's CTR. Our reproducibility efforts include two different techniques: using the same data and procedures described in the study, we successfully reproduce the two claims through a blind computational approach, with only minor and inconsequential discrepancies. When using the authors' codes, we reproduce the two claims with identical numerical results. Examining the robustness of the authors' claims in a pre-registered third step, we validate and apply a "semantic" sentiment analysis using two large language models to re-compute their independent variables describing negativity and positivity. While we find support for the negativity bias, we do not find semantic (in contrast to lexical) positivity to reduce online news consumption. |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:zbw:i4rdps:199 |
By: | von der Heyde, Leah (LMU Munich); Haensch, Anna-Carolina; Wenz, Alexander (University of Mannheim) |
Abstract: | UPDATED VERSION AT https://arxiv.org/abs/2407.08563 The recent development of large language models (LLMs) has spurred discussions about whether LLM-generated “synthetic samples” could complement or replace traditional surveys, considering their training data potentially reflects attitudes and behaviors prevalent in the population. A number of mostly US-based studies have prompted LLMs to mimic survey respondents, finding that the responses closely match the survey data. However, several contextual factors related to the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this study, we investigate the extent to which LLMs can estimate public opinion in Germany, using the example of vote choice as outcome of interest. To generate a synthetic sample of eligible voters in Germany, we create personas matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. Prompting GPT-3 with each persona, we ask the LLM to predict each respondents’ vote choice in the 2017 German federal elections and compare these predictions to the survey-based estimates on the aggregate and subgroup levels. We find that GPT-3 does not predict citizens’ vote choice accurately, exhibiting a bias towards the Green and Left parties, and making better predictions for more “typical” voter subgroups. While the language model is able to capture broad-brush tendencies tied to partisanship, it tends to miss out on the multifaceted factors that sway individual voter choices. Furthermore, our results suggest that GPT-3 might not be reliable for estimating nuanced, subgroup-specific political attitudes. By examining the prediction of voting behavior using LLMs in a new context, our study contributes to the growing body of research about the conditions under which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitation of applying them for public opinion estimation without accounting for the biases in their training data. |
Date: | 2023–12–15 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:8je9g_v1 |
By: | Mehdi Mikou (ESE - Ecologie Systématique et Evolution - AgroParisTech - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique, CIRED - Centre International de Recherche sur l'Environnement et le Développement - Cirad - Centre de Coopération Internationale en Recherche Agronomique pour le Développement - EHESS - École des hautes études en sciences sociales - AgroParisTech - ENPC - École nationale des ponts et chaussées - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique); Améline Vallet (ESE - Ecologie Systématique et Evolution - AgroParisTech - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique, CIRED - Centre International de Recherche sur l'Environnement et le Développement - Cirad - Centre de Coopération Internationale en Recherche Agronomique pour le Développement - EHESS - École des hautes études en sciences sociales - AgroParisTech - ENPC - École nationale des ponts et chaussées - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique); Céline Guivarch (CIRED - Centre International de Recherche sur l'Environnement et le Développement - Cirad - Centre de Coopération Internationale en Recherche Agronomique pour le Développement - EHESS - École des hautes études en sciences sociales - AgroParisTech - ENPC - École nationale des ponts et chaussées - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique); David Makowski (MIA Paris-Saclay - Mathématiques et Informatique Appliquées - AgroParisTech - Université Paris-Saclay - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement) |
Abstract: | Income maps have been extensively used for identifying populations vulnerable to global changes. The frequency and intensity of extreme events are likely to increase in coming years as a result of climate change. In this context, several studies have hypothesized that the economic and social impact of extreme events depend on income. However, to rigorously test this hypothesis, fine‐scale spatial income data is needed, compatible with the analysis of extreme climatic events. To produce reliable high‐resolution income data, we have developed an innovative machine learning framework, that we applied to produce a European 1 km‐gridded data set of per capita disposable income for 2015. This data set was generated by downscaling income data available for more than 120, 000 administrative units. Our learning framework showed high accuracy levels, and performed better or equally than other existing approaches used in the literature for downscaling income. It also yielded better results for the estimation of spatial inequality within administrative units. Using SHAP values, we explored the contribution of the model predictors to income predictions and found that, in addition to geographic predictors, distance to public transport or nighttime light intensity were key drivers of income predictions. More broadly, this data set offers an opportunity to explore the relationships between economic inequality and environmental degradation in health, adaptation or urban planning sectors. It can also facilitate the development of future income maps that align with the Shared Socioeconomic Pathways, and ultimately enable the assessment of future climate risks. |
Keywords: | machine learning, random forest, income, Europe, spatial modeling, economic vulnerability |
Date: | 2025 |
URL: | https://d.repec.org/n?u=RePEc:hal:journl:hal-04906700 |
By: | Churchill, Alexander; Pichika, Shamitha; Xu, Chengxin (Seattle University); Liu, Ying (Rutgers University) |
Abstract: | Text annotation, the practice of labeling text following a predetermined scheme, is essential to qualitative researcher in public policy. Despite its importance utility, text annotation for policy research faces challenges of high labor and time costs, particularly when the size of the qualitative data is enormous. Recent Developments in large language models (LLMs), specifically models with generative pre-trained transformers (GPTs), shows a potential approach that may alleviate the burden of manual annotation coding. In this report, we test if Open AI’s GPT3.5 and GPT-4 models can be employed for text annotation tasks and measure the results of different prompting strategies against manual annotation. Using email messages collected from a national corresponding experiment in the U.S. nursing home market as an example, on average, we demonstrate 86.25% percentage agreement between GPT and human annotations. We also show that GPT models possess context-based limitations. Our report ends with suggestions, guidance, and reflections for readers who are interested in using GPT models for text annotation. |
Date: | 2024–01–25 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:6fpgj_v1 |
By: | Gil-Clavel, Sofia (Max Planck Institute for Demographic Research); Wagenblast, Thorid; Akkerman, Joos (Delft University of Technology); Filatova, Tatiana |
Abstract: | Understanding which climate change adaptation constraints manifest for different actors – governments, communities, individuals and households – is essential, as adaptation is turning into a matter of survival. Though rich qualitative research reveals constraints for diverse cases, methods to consolidate knowledge and elicit patterns in adaptation constraints for various actors and hazards are scarce. We fill this gap by analyzing associations between different adaptations and actors’ constraints in adaptation to climate-induced floods and sea-level rise. Our novel approach derives textual data from peer-reviewed articles (published before February 2024) by using natural language processing, supervised learning, thematic coding books, and network analysis. Results show that social capital, economic factors, and government support are constraints shared among all actors. With respect to adaptation types, communities are frequently associated with maladaptation, while individuals and households are frequently associated with transformational adaptation. |
Date: | 2024–04–26 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:3cqvn_v1 |
By: | Menaka Hampole; Dimitris Papanikolaou; Lawrence D.W. Schmidt; Bryan Seegmiller |
Abstract: | We leverage recent advances in NLP to construct measures of workers' task exposure to AI and machine learning technologies over the 2010 to 2023 period that vary across firms and time. Using a theoretical framework that allows for a labor-saving technology to affect worker productivity both directly and indirectly, we show that the impact on wage earnings and employment can be summarized by two statistics. First, labor demand decreases in the average exposure of workers' tasks to AI technologies; second, holding the average exposure constant, labor demand increases in the dispersion of task exposures to AI, as workers shift effort to tasks that are not displaced by AI. Exploiting exogenous variation in our measures based on pre-existing hiring practices across firms, we find empirical support for these predictions, together with a lower demand for skills affected by AI. Overall, we find muted effects of AI on employment due to offsetting effects: highly-exposed occupations experience relatively lower demand compared to less exposed occupations, but the resulting increase in firm productivity increases overall employment across all occupations. |
JEL: | E20 J01 J24 O3 O33 |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:33509 |
By: | Esteban Rossi-Hansberg; Jialing Zhang |
Abstract: | We use high-resolution spatial data to build a novel global annual gridded GDP dataset at 1°, 0.5°, and 0.25° resolutions from 2012 onward. Our random forest model trained on local and national GDP achieves an R² above 0.92 for GDP levels and above 0.62 for annual changes in regions left out of the training sample. By incorporating diverse indicators beyond population and nighttime lights, our estimates offer more precise subnational GDP measurements for analyzing economic shocks, local policies, and regional disparities. We evaluate the precision of our estimates with a sample case of COVID-19’s impact on local GDP in China. |
JEL: | E0 F0 R0 |
Date: | 2025–02 |
URL: | https://d.repec.org/n?u=RePEc:nbr:nberwo:33458 |
By: | Alhorr, Layane |
Abstract: | Social norms and childcare responsibilities often constrain women's mobility and work. This paper investigates the promise of digitalization in unlocking the growth of home-based businesses, an economic lifeline for women in developing countries. To do so, Jordanian female entrepreneurs were offered access to virtual storefronts through Facebook business pages, as well as access to an online digital marketing training created in collaboration with local social media influencers. After six months of hands-on support, treated women had higher business survival, weekly revenue, and attracted more online clients. Machine learning heterogeneity analysis reveals that higher business performance and limitations on the owner's ability to leave her house at baseline are particularly predictive of effects. Together, results suggest that when constraints to technology adoption are lifted, digitalization can unlock windows of opportunity to talented female entrepreneurs, especially those mobility-constrained among them. |
Date: | 2024–06–13 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10803 |
By: | von der Heyde, Leah (LMU Munich); Haensch, Anna-Carolina; Wenz, Alexander (University of Mannheim) |
Abstract: | [For the more comprehensive version of this work, please see https://arxiv.org/abs/2407.08563] The rise of large language models (LLMs) like GPT-3 has sparked interest in their potential for creating synthetic datasets, particularly in the realm of privacy research. This study critically evaluates the use of LLMs in generating synthetic public opinion data, pointing out the biases inherent in the data generation process. While LLMs, trained on vast internet datasets, can mimic societal attitudes and behaviors, their application in synthesizing data poses significant privacy and accuracy challenges. We investigate these issues using the case of vote choice prediction in the 2017 German federal elections. Employing GPT-3, we construct synthetic personas based on the German Longitudinal Election Study, prompting the LLM to predict voting behavior. Our analysis compares these LLM-generated predictions with actual survey data, focusing on the implications of using such synthetic data and the biases it may contain. The results demonstrate GPT-3’s propensity to inaccurately predict voter choices, with biases favoring certain political groups and more predictable voter profiles. This outcome raises critical questions about the reliability and ethical use of LLMs in generating synthetic data. |
Date: | 2023–12–01 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:97r8s_v1 |
By: | Chase, Sarah K.; Sachdeva, Sonya; Wood, Spencer A (University of Washington); Lawler, Joshua J |
Abstract: | 1. Addressing social and ecological values is a central aim of democratic environmental management and policymaking, especially during deliberative and participatory processes. Agencies responsible for managing public lands would benefit from a deepened understanding of how various publics’ value those lands. 2. Federal land management agencies receive millions of written comments from the public on proposed management actions annually, providing a unique source of insights into how the public assigns value to public lands. To date, little attention has been directed towards methods for analyzing the public’s comments to understand their expressed values, in part because the volume of comments often makes manual analysis unworkable. 3. This study introduces and applies a novel computational approach to inferring values in written text by using natural language processing and a method that combines a lexicon with semantic embedding models. We developed embedding models for four types of values that are expressed in public comments. We then fit models to 409, 241 public comments on actions proposed by the United States Forest Service from 2011 to 2020 and regulated by the Natural Environmental Policy Act. 4. The embedding model generally outperformed the lexicon word-count, particularly for value types with shorter lexicons, and, like human evaluators, the embedding models performed better for more evident values and were less reliable for more abstract or latent values. 5. By applying the resulting model, we furthered our understanding of how the public values National Forest lands in the United States. We observed that aesthetic and moral values were expressed more often in comments for projects that received more public interest, as gauged by the number of comments a project received and in comments for projects addressing recreational management. |
Date: | 2025–02–21 |
URL: | https://d.repec.org/n?u=RePEc:osf:socarx:f4pgy_v1 |
By: | Stacy, Brian William; Kitzmüller, Lucas; Wang, Xiaoyu; Mahler, Daniel Gerszon; Serajuddin, Umar |
Abstract: | Data-driven research on a country is key to producing evidence-based public policies. Yet little is known about where data-driven research is lacking and how it could be expanded. This paper proposes a method for tracking academic data use by country of subject, applying natural language processing to open-access research papers. The model’s predictions produce country estimates of the number of articles using data that are highly correlated with a human-coded approach, with a correlation of 0.99. Analyzing more than 1 million academic articles, the paper finds that the number of articles on a country is strongly correlated with its gross domestic product per capita, population, and the quality of its national statistical system. The paper identifies data sources that are strongly associated with data-driven research and finds that availability of subnational data appears to be particularly important. Finally, the paper classifies countries into groups based on whether they could most benefit from increasing their supply of or demand for data. The findings show that the former applies to many low- and lower-middle-income countries, while the latter applies to many upper-middle- and high-income countries. |
Date: | 2024–01–17 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10673 |
By: | Adewopo, Julius Babatunde; Andree, Bo Pieter Johannes; Peter, Helen; Solano-Hermosilla, Gloria; Micale, Fabio |
Abstract: | High-frequency monitoring of food commodity prices is important for assessing and responding to shocks, especially in fragile contexts where timely and targeted interventions for food security are critical. However, national price surveys are typically limited in temporal and spatial granularity. It is cost prohibitive to implement traditional data collection at frequent timescales to unravel spatiotemporal price evolution across market segments and at subnational geographic levels. Recent advancements in data innovation offer promising solutions to address the paucity of commodity price data and guide market intelligence for diverse development stakeholders. The use of artificial intelligence to estimate missing price data and a parallel effort to crowdsource commodity price data are both unlocking cost-effective opportunities to generate actionable price data. Yet, little is known about how the data from these alternative methods relate to independent ground truth data. To evaluate if these data strategies can meet the long-standing demand for real-time intelligence on food affordability, this paper analyzes open-source daily crowdsourced data (104, 931 datapoints) from a recently published data set in Nature Journal, relative to complementary ground truth sample. The paper subsequently compares these data to open-source monthly artificial intelligence–generated price data for identical commodities over a 36-month period in northern Nigeria, from 2019 to 2022. The results show that all the data sources share a high degree of comparability, with variation across commodity and market segments. Overall, the findings provide important support for leveraging these new and innovative data approaches to enable data-driven decision-making in near real time. |
Date: | 2024–04–23 |
URL: | https://d.repec.org/n?u=RePEc:wbk:wbrwps:10758 |