nep-big New Economics Papers
on Big Data
Issue of 2022‒07‒25
24 papers chosen by
Tom Coupé
University of Canterbury

  1. The Data Economy: Market Size and Global Trade By Diane Coyle; Wendy Li
  2. Developing experimental estimates of regional skill demand By Stef Garasto; Jyldyz Djumalieva; Karlis Kanders; Rachel Wilcock; Cath Sleeman
  4. Human Wellbeing and Machine Learning By Kaiser, Caspar; Oparina, Ekaterina; Gentile, Niccolò; Tkatchenko, Alexandre; Clark, Andrew E.; De Neve, Jan-Emmanuel; D’Ambrosio, Conchita
  5. Road Quality and Mean Speed Score By Marian Moszoro; Mauricio Soto
  6. A Random Forest approach of the Evolution of Inequality of Opportunity in Mexico By Thibaut Plassot; Isidro Soloaga; Pedro J. Torres L.
  7. Communicating Data Uncertainty: Multi-Wave Experimental Evidence for U.K. GDP By Ana Galvao; James Mitchell
  8. Systematizing Macroframework Forecasting: High-Dimensional Conditional Forecasting with Accounting Identities By Mr. Sakai Ando; Mr. Taehoon Kim
  9. Textual analysis of a Twitter corpus during the COVID-19 pandemics By Valerio Astuti; Marta Crispino; Marco Langiulli; Juri Marcucci
  10. Nothing Propinks Like Propinquity: Using Machine Learning to Estimate the Effects of Spatial Proximity in the Major League Baseball Draft By Majid Ahmadi; Nathan Durst; Jeff Lachman; John List; Mason List; Noah List; Atom Vayalinkal
  11. A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin By Yanzhao Zou; Dorien Herremans
  12. Reported Social Unrest Index: March 2022 Update By Mr. Philip Barrett
  13. Economic Implications of the Use of Personal Information: Potential Impact of the Digital Platform Companies on Payment Services By Sei Sugino; Yuji Maruo
  14. A novel approach to rating transition modelling via Machine Learning and SDEs on Lie groups By Kevin Kamm; Michelle Muniz
  15. Overcoming Data Sparsity: A Machine Learning Approach to Track the Real-Time Impact of COVID-19 in Sub-Saharan Africa By Karim Barhoumi; Jiaxiong Yao; Tara Iyer; Seung Mo Choi; Jiakun Li; Franck Ouattara; Mr. Andrew J Tiffin
  16. Reinforcement Learning in Macroeconomic Policy Design: A New Frontier? By Callum Tilbury
  17. Measuring Quarterly Economic Growth from Outer Space By Robert C. M. Beyer; Jiaxiong Yao; Yingyao Hu
  18. Machine Learning Can Predict Shooting Victimization Well Enough to Help Prevent It By Sara B. Heller; Benjamin Jakubowski; Zubin Jelveh; Max Kapustin
  19. Measuring the Tolerance of the State: Theory and Application to Protest By Veli Andirin; Yusuf Neggers; Mehdi Shadmehr; Jesse M. Shapiro
  20. (Machine) Learning What Policies Value By Daniel Bj\"orkegren; Joshua E. Blumenstock; Samsun Knight
  21. What We Talk about When We Talk about Self-employment: Examining Self-employment and the Transition to Retirement among Older Adults in the United States By Joelle Abramowitz
  22. Finite-Sample Guarantees for High-Dimensional DML By Victor Quintas-Martinez
  23. An interpretable machine learning workflow with an application to economic forecasting By Buckmann, Marcus; Joseph, Andreas
  24. Predicting Political Ideology from Digital Footprints By Michael Kitchener; Nandini Anantharama; Simon D. Angus; Paul A. Raschky

  1. By: Diane Coyle; Wendy Li
    Abstract: Data is a key digital economy input and its use is growing rapidly. Large online platforms using data at massive scale operate globally. The data gap between them and the incumbents they disrupt, a barrier to entry in the markets they dominate, affects not only firms but also aggregate innovation, investment and trade. Valuing data is problematic, yet this information is crucial for informed policy decisions on infrastructure and human capital as well as business investment decisions. In this paper we demonstrate a novel sectoral methodology for estimating the economic value of markets for data. Our conservative estimate of the market size for data in the global hospitality industry was US $43.2 billion in 2018, and it has been doubling its size every three years. Our method can provide industry-level and country-level information on data markets. The scale of data flows affects the international division of labor in the digital economy, with important policy implications. With many jurisdictions introducing different data protection and trade regimes, affecting the data gap and data access by market participants, we present a trade typology of countries and discuss their ability to benefit from data value creation.
    Keywords: data, digital, innovation, trade
    JEL: F1 F2 O3
    Date: 2021–08
  2. By: Stef Garasto; Jyldyz Djumalieva; Karlis Kanders; Rachel Wilcock; Cath Sleeman
    Abstract: This paper shows how novel data, in the form of online job adverts, can be used to enrich social labour market statistics. We use millions of job adverts to provide granular estimates of the vacancy stock broken down by location, occupation and skill category. To derive these estimates, we build on previous work and deploy methodologies for a) converting the flow of job adverts into a stock and b) adjusting this stock to ensure it is representative of the underlying economy. Our results benefit from the use of duration data at the level of individual vacancies. We also introduce a new iteration of Nesta’s skills taxonomy. This is the first iteration to blend an expert-derived collection of skills with the skills extracted from job adverts. These methodological advances allow us to analyse which skill sets are sought by employers, how these vary across Travel To Work Areas in the UK and how skill demand evolves over time. For example, we find that there is considerable geographical variability in skill demand, with the stock varying more than five-fold across locations. At the same time, most of the demand is concentrated among three categories: "Business, law and finance", "Science, manufacturing and engineering" and "Digital". Together, these account for more than 60 per cent of all skills demanded. The type of intelligence presented in this report could be used to support both local and national decision makers in responding to recent labour market disruptions.
    Keywords: big data, labour demand, machine learning, online job adverts, skills, word embeddings
    JEL: C18 J23 J24
    Date: 2021–03
  3. By: Diana Gabrielyan; Lenno Uusküla
    Abstract: We extract measures of inflation expectations from online news to build real interest rates that capture true consumer expectations. The new measure is infused to various Euler consumption models. While benchmark models based on traditional risk-free returns rates fail, models built with novel news-driven inflation expectations indices improve upon benchmark models and result in strong instruments. Our positive findings highlight the role played by the media for consumer expectation formation and allow for the use of such novel data sources for other key macroeconomic relationships.
    Keywords: Euler equation, expectations, media, machine learning
    Date: 2022
  4. By: Kaiser, Caspar; Oparina, Ekaterina; Gentile, Niccolò; Tkatchenko, Alexandre; Clark, Andrew E.; De Neve, Jan-Emmanuel; D’Ambrosio, Conchita
    Abstract: There is a vast literature on the determinants of subjective wellbeing. Yet, standard regression models explain little variation in wellbeing. We here use data from Germany, the UK, and the US to assess the potential of Machine Learning (ML) to help us better understand wellbeing. Compared to traditional models, ML approaches provide moderate improvements in predictive performance. Drastically expanding the set of explanatory variables doubles our predictive ability across approaches on unseen data. The variables identified as important by ML – material conditions, health, social relations – are similar to those previously identified. Our data-driven ML results therefore validate previous conventional approaches.
    Date: 2022–06
  5. By: Marian Moszoro; Mauricio Soto
    Abstract: We introduce a novel measure of cross-country road quality based on the travel mean speed between large cities from Google Maps. This measure is useful to assess road infrastructure and access gaps. Our Mean Speed (MS) score is easier to estimate and update than traditional gauges of road network quality which rely on official reports, surveys (i.e., World Economic Forum’s Quality of Roads Perception survey), or satellite imaging (i.e., World Bank’s Rural Access Index). In a sample of over 160 countries, we find that MS scores range between 38 km/h (23.6 mph) and 107 km/h (66.5 mph). We show that the MS score is a strong proxy for road quality and access.
    Keywords: Road Quality; Sustainable Development Goals; Access to Infrastructure; MS score; road infrastructure; road network characteristic; B. geometric mean speed Score; Infrastructure; Income; Africa; Global
    Date: 2022–05–20
  6. By: Thibaut Plassot (Department of Economics - Universidad Iberoamericana Mexico City); Isidro Soloaga (Department of Economics - Universidad Iberoamericana Mexico City); Pedro J. Torres L. (Department of Economics - Universidad Iberoamericana Mexico City)
    Abstract: This work presents the trend of Inequality of Opportunity (IOp) and total inequality in wealth in Mexico for the years 2006, 2011 and 2017, and provides estimations using both an ex-ante and ex-post compensation criterion. We resort on a data-driven approach using supervised machine learning models to run regression trees and random forests that consider individuals’ circumstances and effort. We find an intensification of both total inequality and IOp between 2006 and 2011, as well as a reduction of these between 2011 and 2017, being absolute IOp slightly higher in 2017 than in 2006. From an ex-ante perspective, the share of IOp within total inequality slightly decreased although using an ex-post perspective the share remains stable across time. The most important variable in determining IOp is household´s wealth at age 14, followed by both, father´s and mother´s education. Other variables such as the ability of the parents to speak an indigenous language proved to have had a lower impact over time.
    JEL: C14 C81 D31 D63
    Date: 2022–06–30
  7. By: Ana Galvao; James Mitchell
    Abstract: Economic statistics are commonly published without any explicit indication of their uncertainty. To assess if and how the UK public interpret and understand data uncertainty, we conduct two waves of a randomized controlled online experiment. A control group is presented with the headline point estimate of GDP, as emphasized by the statistical office. Treatment groups are then presented with alternative qualitative and quantitative communications of GDP data uncertainty. We find that most of the public understand there is uncertainty inherent in official GDP numbers. But communicating uncertainty information improves understanding. It encourages the public not to take estimates at face-value, but does not decrease trust in the data. Quantitative tools to communicate data uncertainty – notably intervals, density strips and bell curves – are especially beneficial. They reduce dispersion of the public’s subjective probabilistic expectations of data uncertainty, improving alignment with objective estimates.
    Keywords: data revisions, data uncertainty, uncertainty communication
    JEL: C82 D80 E01
    Date: 2021–06
  8. By: Mr. Sakai Ando; Mr. Taehoon Kim
    Abstract: Forecasting a macroframework, which consists of many macroeconomic variables and accounting identities, is widely conducted in the policy arena to present an economic narrative and check its consistency. Such forecasting, however, is challenging because forecasters should extend limited information to the entire macroframework in an internally consistent manner. This paper proposes a method to systematically forecast macroframework by integrating (1) conditional forecasting with machine-learning techniques and (2) forecast reconciliation of hierarchical time series. We apply our method to an advanced economy and a tourism-dependent economy using France and Seychelles and show that it can improve the WEO forecast.
    Keywords: Macroframework; Conditional Forecasting; Reconciliation; Accounting Identities; Hierarchical Time Series
    Date: 2022–06–03
  9. By: Valerio Astuti (Bank of Italy); Marta Crispino (Bank of Italy); Marco Langiulli (Bank of Italy); Juri Marcucci (Bank of Italy)
    Abstract: Text data gathered from social media are extremely up-to-date and have a great potential value for economic research. At the same time, they pose some challenges, as they require different statistical methods from the ones used for traditional data. The aim of this paper is to give a critical overview of three of the most common techniques used to extract information from text data: topic modelling, word embedding and sentiment analysis. We apply these methodologies to data collected from Twitter during the COVID-19 pandemic to investigate the influence the pandemic had on the Italian Twitter community and to discover the topics most actively discussed on the platform. Using these techniques of automated textual analysis, we are able to make inferences about the most important subjects covered over time and build real-time daily indicators of the sentiment expressed on this platform.
    Keywords: text as data, Twitter, big data, sentiment, Covid-19, topic analysis, word embedding
    JEL: C55 C14 C81 L82
    Date: 2022–06
  10. By: Majid Ahmadi; Nathan Durst; Jeff Lachman; John List; Mason List; Noah List; Atom Vayalinkal
    Abstract: Recent models and empirical work on network formation emphasize the importance of propinquity in producing strong interpersonal connections. Yet, one might wonder how deep such insights run, as thus far empirical results rely on survey and lab-based evidence. In this study, we examine propinquity in a high-stakes setting of talent allocation: the Major League Baseball (MLB) Draft. We examine draft picks from 2000-2019 across every MLB club of the nearly 30,000 players drafted (from a player pool of more than a million potential draftees). Our findings can be summarized in three parts. First, propinquity is alive and well in our setting, and spans even the latter years of our sample, when higher-level statistical exercises have become the norm rather than the exception. Second, the measured effect size is important, as MLB clubs pay a real cost in terms of inferior talent acquired due to propinquity bias: for example, their draft picks appear in 25 fewer games relative to teams that do not exhibit propinquity bias. Finally, the effect is found to be the most pronounced in later rounds of the draft (after round 15), where the Scouting Director has the greatest latitude.
    Date: 2022
  11. By: Yanzhao Zou; Dorien Herremans
    Abstract: Bitcoin, with its ever-growing popularity, has demonstrated extreme price volatility since its origin. This volatility, together with its decentralised nature, make Bitcoin highly subjective to speculative trading as compared to more traditional assets. In this paper, we propose a multimodal model for predicting extreme price fluctuations. This model takes as input a variety of correlated assets, technical indicators, as well as Twitter content. In an in-depth study, we explore whether social media discussions from the general public on Bitcoin have predictive power for extreme price movements. A dataset of 5,000 tweets per day containing the keyword `Bitcoin' was collected from 2015 to 2021. This dataset, called PreBit, is made available online. In our hybrid model, we use sentence-level FinBERT embeddings, pretrained on financial lexicons, so as to capture the full contents of the tweets and feed it to the model in an understandable way. By combining these embeddings with a Convolutional Neural Network, we built a predictive model for significant market movements. The final multimodal ensemble model includes this NLP model together with a model based on candlestick data, technical indicators and correlated asset prices. In an ablation study, we explore the contribution of the individual modalities. Finally, we propose and backtest a trading strategy based on the predictions of our models with varying prediction threshold and show that it can used to build a profitable trading strategy with a reduced risk over a `hold' or moving average strategy.
    Date: 2022–05
  12. By: Mr. Philip Barrett
    Abstract: This paper updates the Reported Social Unrest Index of Barrett et al (2020), reviewing recent developments in social unrest worldwide since the start of the COVID-19 Pandemic. It shows that unrest was elevated during late 2019, coincident with widespread protests in Latin America. Unrest then fell markedly during the early stages of the pandemic as voluntary and involuntary social distancing struck. Social unrest has since returned but generally remains below levels seen in 2019.
    Keywords: Social unrest; Media coverage; et al; unrest vulnerability; machine learning technique; RSUI-identified events; copyright page; COVID-19; Caribbean; Global
    Date: 2022–05–06
  13. By: Sei Sugino (Bank of Japan); Yuji Maruo (Bank of Japan)
    Abstract: Personal information is actively collected, processed, and transmitted against the backdrop of the digitalization of people's social activities and the rapid improvement of data analysis methods, including artificial intelligence and machine learning. In order for individuals and society to benefit from the effective use of personal information, it is important that a broad range of entities are able to have proper opportunities to use personal information with due consideration to privacy protection. To achieve this goal, self-management of private information is crucial, but not sufficient. There might remain considerable inefficiencies associated with "market failure" and "negative externalities." This article will provide an overview of international discussion of academics and policy makers regarding these issues. In particular, we will illustrate the mechanism, which might generate monopoly power of the digital platform companies, and potential implications of public digital payment instruments in light of eliminating inefficiencies arising from the less competitive market.
    Keywords: digital payments; privacy; data intermediaries; informational externalities
    Date: 2022–06–23
  14. By: Kevin Kamm; Michelle Muniz
    Abstract: In this paper, we introduce a novel methodology to model rating transitions with a stochastic process. To introduce stochastic processes, whose values are valid rating matrices, we noticed the geometric properties of stochastic matrices and its link to matrix Lie groups. We give a gentle introduction to this topic and demonstrate how It\^o-SDEs in R will generate the desired model for rating transitions. To calibrate the rating model to historical data, we use a Deep-Neural-Network (DNN) called TimeGAN to learn the features of a time series of historical rating matrices. Then, we use this DNN to generate synthetic rating transition matrices. Afterwards, we fit the moments of the generated rating matrices and the rating process at specific time points, which results in a good fit. After calibration, we discuss the quality of the calibrated rating transition process by examining some properties that a time series of rating matrices should satisfy, and we will see that this geometric approach works very well.
    Date: 2022–05
  15. By: Karim Barhoumi; Jiaxiong Yao; Tara Iyer; Seung Mo Choi; Jiakun Li; Franck Ouattara; Mr. Andrew J Tiffin
    Abstract: The COVID-19 crisis has had a tremendous economic impact for all countries. Yet, assessing the full impact of the crisis has been frequently hampered by the delayed publication of official GDP statistics in several emerging market and developing economies. This paper outlines a machine-learning framework that helps track economic activity in real time for these economies. As illustrative examples, the framework is applied to selected sub-Saharan African economies. The framework is able to provide timely information on economic activity more swiftly than official statistics.
    Keywords: Sub-Saharan Africa; Economic Activity; GDP; Machine Learning; Nowcasting; COVID-19; machine learning approach; data sparsity; GDP statistics; crisis in Sub-Saharan Africa; learning framework; Oil prices; Real effective exchange rates; Africa; Global
    Date: 2022–05–06
  16. By: Callum Tilbury
    Abstract: Agent-based computational macroeconomics is a field with a rich academic history, yet one which has struggled to enter mainstream policy design toolboxes, plagued by the challenges associated with representing a complex and dynamic reality. The field of Reinforcement Learning (RL), too, has a rich history, and has recently been at the centre of several exponential developments. Modern RL implementations have been able to achieve unprecedented levels of sophistication, handling previously-unthinkable degrees of complexity. This review surveys the historical barriers of classical agent-based techniques in macroeconomic modelling, and contemplates whether recent developments in RL can overcome any of them.
    Date: 2022–06
  17. By: Robert C. M. Beyer; Jiaxiong Yao; Yingyao Hu
    Abstract: This paper presents a novel framework to estimate the elasticity between nighttime lights and quarterly economic activity. The relationship is identified by accounting for varying degrees of measurement errors in nighttime light data across countries. The estimated elasticity is 1.55 for emerging markets and developing economies, ranging from 1.36 to 1.81 across country groups and robust to different model specifications. The paper uses a light-adjusted measure of quarterly economic activity to show that higher levels of development, statistical capacity, and voice and accountability are associated with more precise national accounts data. The elasticity allows quantification of subnational economic impacts. During the COVID-19 pandemic, regions with higher levels of development and population density experienced larger declines in economic activity.
    Keywords: Nighttime lights; economic measurement; quarterly GDP; national accounts; COVID-19.
    Date: 2022–06–03
  18. By: Sara B. Heller; Benjamin Jakubowski; Zubin Jelveh; Max Kapustin
    Abstract: This paper shows that shootings are predictable enough to be preventable. Using arrest and victimization records for almost 644,000 people from the Chicago Police Department, we train a machine learning model to predict the risk of being shot in the next 18 months. We address central concerns about police data and algorithmic bias by predicting shooting victimization rather than arrest, which we show accurately captures risk differences across demographic groups despite bias in the predictors. Out-of-sample accuracy is strikingly high: of the 500 people with the highest predicted risk, 13 percent are shot within 18 months, a rate 130 times higher than the average Chicagoan. Although Black male victims more often have enough police contact to generate predictions, those predictions are not, on average, inflated; the demographic composition of predicted and actual shooting victims is almost identical. There are legal, ethical, and practical barriers to using these predictions to target law enforcement. But using them to target social services could have enormous preventive benefits: predictive accuracy among the top 500 people justifies spending up to $123,500 per person for an intervention that could cut their risk of being shot in half.
    JEL: C53 H75 I14 K42
    Date: 2022–06
  19. By: Veli Andirin; Yusuf Neggers; Mehdi Shadmehr; Jesse M. Shapiro
    Abstract: We develop a measure of a regime's tolerance for an action by its citizens. We ground our measure in an economic model and apply it to the setting of political protest. In the model, a regime anticipating a protest can take a costly action to repress it. We define the regime's tolerance as the ratio of its cost of repression to its cost of protest. Because an intolerant regime will engage in repression whenever protest is sufficiently likely, a regime's tolerance determines the maximum equilibrium probability of protest. Tolerance can therefore be identified from the distribution of protest probabilities. We construct a novel cross-national database of protest occurrence and protest predictors, and apply machine-learning methods to estimate protest probabilities. We use the estimated protest probabilities to form a measure of tolerance at the country, country-year, and country-month levels. We apply the measure to questions of interest.
    JEL: C55 D74
    Date: 2022–06
  20. By: Daniel Bj\"orkegren; Joshua E. Blumenstock; Samsun Knight
    Abstract: When a policy prioritizes one person over another, is it because they benefit more, or because they are preferred? This paper develops a method to uncover the values consistent with observed allocation decisions. We use machine learning methods to estimate how much each individual benefits from an intervention, and then reconcile its allocation with (i) the welfare weights assigned to different people; (ii) heterogeneous treatment effects of the intervention; and (iii) weights on different outcomes. We demonstrate this approach by analyzing Mexico's PROGRESA anti-poverty program. The analysis reveals that while the program prioritized certain subgroups -- such as indigenous households -- the fact that those groups benefited more implies that they were in fact assigned a lower welfare weight. The PROGRESA case illustrates how the method makes it possible to audit existing policies, and to design future policies that better align with values.
    Date: 2022–06
  21. By: Joelle Abramowitz (University of Michigan)
    Abstract: The fraction of workers who are self-employed increases with age, but the types of self-employment that older workers do and the effects of this work on their well-being is not well understood. This project examines such heterogeneity by considering how differing investment and managerial responsibilities in self-employment contribute to disparities in characteristics and measures of economic, physical, and mental well-being. The paper first uses internal narrative descriptions of industry and occupation in the 1994 to 2018 Health and Retirement Study and machine learning methods to classify self-employment reports into a useful framework of self-employment roles. The project then uses these roles to examine self-employment heterogeneity and finds substantial differences in demographic characteristics, work characteristics, income, benefits, quality of life, and retirement expectations across self-employment roles. Further work finds distinctive patterns in role changes with the transition to retirement such that large shares of workers in all roles transition into independent self-employment at the time of retirement. Work linking to administrative records suggests substantial discrepancies, which vary across roles, between survey responses and administrative records and finds the most prominent discrepancies for post-retirement independent self-employment. The paper\rquote s findings motivate future research exploring the work trajectories leading to these roles and their consequences on financial, physical, and mental well-being into retirement.
    Date: 2021–09
  22. By: Victor Quintas-Martinez
    Abstract: Debiased machine learning (DML) offers an attractive way to estimate treatment effects in observational settings, where identification of causal parameters requires a conditional independence or unconfoundedness assumption, since it allows to control flexibly for a potentially very large number of covariates. This paper gives novel finite-sample guarantees for joint inference on high-dimensional DML, bounding how far the finite-sample distribution of the estimator is from its asymptotic Gaussian approximation. These guarantees are useful to applied researchers, as they are informative about how far off the coverage of joint confidence bands can be from the nominal level. There are many settings where high-dimensional causal parameters may be of interest, such as the ATE of many treatment profiles, or the ATE of a treatment on many outcomes. We also cover infinite-dimensional parameters, such as impacts on the entire marginal distribution of potential outcomes. The finite-sample guarantees in this paper complement the existing results on consistency and asymptotic normality of DML estimators, which are either asymptotic or treat only the one-dimensional case.
    Date: 2022–06
  23. By: Buckmann, Marcus (Bank of England); Joseph, Andreas (Bank of England)
    Abstract: We propose a generic workflow for the use of machine learning models to inform decision making and to communicate modelling results with stakeholders. It involves three steps: (1) a comparative model evaluation, (2) a feature importance analysis and (3) statistical inference based on Shapley value decompositions. We discuss the different steps of the workflow in detail and demonstrate each by forecasting changes in US unemployment one year ahead using the well-established FRED-MD dataset. We find that universal function approximators from the machine learning literature, including gradient boosting and artificial neural networks, outperform more conventional linear models. This better performance is associated with greater flexibility, allowing the machine learning models to account for time-varying and nonlinear relationships in the data generating process. The Shapley value decomposition identifies economically meaningful nonlinearities learned by the models. Shapley regressions for statistical inference on machine learning models enable us to assess and communicate variable importance akin to conventional econometric approaches. While we also explore high-dimensional models, our findings suggest that the best trade-off between interpretability and performance of the models is achieved when a small set of variables is selected by domain experts.
    Keywords: machine learning; model interpretability; forecasting; unemployment; Shapley values
    JEL: C14 C38 C45 C52 C53 C71 E24
    Date: 2022–06–01
  24. By: Michael Kitchener; Nandini Anantharama; Simon D. Angus; Paul A. Raschky
    Abstract: This paper proposes a new method to predict individual political ideology from digital footprints on one of the world's largest online discussion forum. We compiled a unique data set from the online discussion forum reddit that contains information on the political ideology of around 91,000 users as well as records of their comment frequency and the comments' text corpus in over 190,000 different subforums of interest. Applying a set of statistical learning approaches, we show that information about activity in non-political discussion forums alone, can very accurately predict a user's political ideology. Depending on the model, we are able to predict the economic dimension of ideology with an accuracy of up to 90.63% and the social dimension with and accuracy of up to 82.02%. In comparison, using the textual features from actual comments does not improve predictive accuracy. Our paper highlights the importance of revealed digital behaviour to complement stated preferences from digital communication when analysing human preferences and behaviour using online data.
    Date: 2022–06

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.