nep-big 2021-03-01 papers

on Big Data

Issue of 2021‒03‒01
eighteen papers chosen by
Tom Coupé
University of Canterbury

Can we measure inflation expectations using Twitter? By Cristina Angelico; Juri Marcucci; Marcello Miccoli; Filippo Quarta
The Gender Pay Gap Revisited with Big Data: Do Methodological Choices Matter? By Strittmatter, Anthony; Wunsch, Conny
Predicting poverty and malnutrition for targeting, mapping, monitoring, and early warning By McBride, Linden; Barrett, Christopher B.; Browne, Christopher; Hu, Leiqiu; Liu, Yanyan; Matteson, David S.; Sun, Ying; Wen, Jiaming
A learning scheme by sparse grids and Picard approximations for semilinear parabolic PDEs By Jean-Fran\c{c}ois Chassagneux; Junchao Chen; Noufel Frikha; Chao Zhou
Supporting Financial Inclusion with Graph Machine Learning and Super-App Alternative Data By Luisa Roa; Andr\'es Rodr\'iguez-Rey; Alejandro Correa-Bahnsen; Carlos Valencia
Modelos de Aprendizaje Automático Mediante Árboles de Decisión de Decisión By Carlos Arana
Firm-Level Risk Exposures and Stock Returns in the Wake of COVID-19 By Steven J. Davis; Stephen Hansen; Cristhian Seminario-Amez
Computing Consumer Sentiment in Germany via Social Media Data By Karaman Örsal, Deniz Dilan; Sturm, Silke
Comparing Conventional and Machine-Learning Approaches to Risk Assessment in Domestic Abuse Cases By Jeffrey Grogger; Sean Gupta; Ria Ivandic; Tom Kirchmaier
Open Banking: Credit Market Competition When Borrowers Own the Data By Zhiguo He; Jing Huang; Jidong Zhou
Deep Video Prediction for Time Series Forecasting By Zhen Zeng; Tucker Balch; Manuela Veloso
Politician-Citizen Interactions and Dynamic Representation: Evidence from Twitter By Aina Gallego; Gaël Le Mens; Nikolas Schöll
Politician-citizen interactions and dynamic representation: Evidence from Twitter By Aina Gallego; Nikolas Schöll; Gaël Le Mens
CEO behavior and firm performance By Bandiera, Oriana; Prat, Andrea; Hansen, Stephen; Sadun, Raffaella
"For the times they are a-changin": Gauging Uncertainty Perception over Time By Müller, Henrik; Hornig, Nico; Rieger, Jonas
Artificial Intelligence, Teacher Tasks and Individualized Pedagogy By Ferman, Bruno; Lima, Lycia; Riva, Flávio
Governance of Data Sharing : a Law & Economics Proposal By Graef, Inge; Prüfer, Jens
Données de santé : l'arbre StopCovid qui cache la forêt Health Data Hub By Bernard Fallery

Can we measure inflation expectations using Twitter?

By:	Cristina Angelico (Bank of Italy); Juri Marcucci (Bank of Italy); Marcello Miccoli (International Monetary Fund); Filippo Quarta (Bank of Italy)
Abstract:	Using Italian data from Twitter, we employ textual data and machine learning techniques to build new real-time measures of consumers' inflation expectations. First, we select some relevant keywords to identify tweets related to prices and expectations thereof. Second, we build a set of daily measures of inflation expectations on the selected tweets combining the Latent Dirichlet Allocation (LDA) with a dictionary-based approach, using manually labelled bi-grams and tri-grams. Finally, we show that the Twitter-based indicators are highly correlated with both monthly survey-based and daily market-based inflation expectations. Our new indicators provide additional information beyond the market-based expectations, the professional forecasts, and the realized inflation, and anticipate consumers' expectations proving to be a good real-time proxy. Results suggest that Twitter can be a new timely source to elicit beliefs.
Keywords:	inflation expectations, Twitter data, text mining, big data, survey-based measures, market-based measures, forecasting
JEL:	E31 C53 C55 D84 E58
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:bdi:wptemi:td_1318_21&r=all

The Gender Pay Gap Revisited with Big Data: Do Methodological Choices Matter?

By:	Strittmatter, Anthony (University of Basel); Wunsch, Conny (University of Basel)
Abstract:	The vast majority of existing studies that estimate the average unexplained gender pay gap use unnecessarily restrictive linear versions of the Blinder-Oaxaca decomposition. Using a notably rich and large data set of 1.7 million employees in Switzerland, we investigate how the methodological improvements made possible by such big data affect estimates of the unexplained gender pay gap. We study the sensitivity of the estimates with regard to i) the availability of observationally comparable men and women, ii) model exibility when controlling for wage determinants, and iii) the choice of different parametric and semiparametric estimators, including variants that make use of machine learning methods. We find that these three factors matter greatly. Blinder-Oaxaca estimates of the unexplained gender pay gap decline by up to 39% when we enforce comparability between men and women and use a more exible specication of the wage equation. Semi-parametric matching yields estimates that when compared with the Blinder-Oaxaca estimates, are up to 50% smaller and also less sensitive to the way wage determinants are included.
Keywords:	Gender Inequality, Gender Pay Gap, Common Support, Model Speci cation, Matching Estimator, Machine Learning
JEL:	J31 C21
Date:	2021–02–18
URL:	http://d.repec.org/n?u=RePEc:bsl:wpaper:2021/05&r=all

Predicting poverty and malnutrition for targeting, mapping, monitoring, and early warning

By:	McBride, Linden; Barrett, Christopher B.; Browne, Christopher; Hu, Leiqiu; Liu, Yanyan; Matteson, David S.; Sun, Ying; Wen, Jiaming
Abstract:	More frequent and severe shocks combined with more plentiful data and increasingly powerful predictive algorithms heighten the promise of data science in support of humanitarian and development programming. We advocate for embrace of, and investment in, machine learning methods for poverty and malnutrition targeting, mapping, monitoring, and early warning while also cautioning that distinct tasks require different data inputs and methods. In particular, we highlight the differences between poverty and malnutrition targeting and mapping, identification of those in a state of structural versus stochastic deprivation, and the modeling and data challenges of developing early warning systems. Overall, we urge careful consideration of the purpose and possible use cases of big data and machine learning informed models.
Keywords:	Food Security and Poverty, Research and Development/Tech Change/Emerging Technologies
Date:	2021–01
URL:	http://d.repec.org/n?u=RePEc:ags:assa21:309060&r=all

A learning scheme by sparse grids and Picard approximations for semilinear parabolic PDEs

By:	Jean-Fran\c{c}ois Chassagneux; Junchao Chen; Noufel Frikha; Chao Zhou
Abstract:	Relying on the classical connection between Backward Stochastic Differential Equations (BSDEs) and non-linear parabolic partial differential equations (PDEs), we propose a new probabilistic learning scheme for solving high-dimensional semi-linear parabolic PDEs. This scheme is inspired by the approach coming from machine learning and developed using deep neural networks in Han and al. [32]. Our algorithm is based on a Picard iteration scheme in which a sequence of linear-quadratic optimisation problem is solved by means of stochastic gradient descent (SGD) algorithm. In the framework of a linear specification of the approximation space, we manage to prove a convergence result for our scheme, under some smallness condition. In practice, in order to be able to treat high-dimensional examples, we employ sparse grid approximation spaces. In the case of periodic coefficients and using pre-wavelet basis functions, we obtain an upper bound on the global complexity of our method. It shows in particular that the curse of dimensionality is tamed in the sense that in order to achieve a root mean squared error of order ${\epsilon}$, for a prescribed precision ${\epsilon}$, the complexity of the Picard algorithm grows polynomially in ${\epsilon}^{-1}$ up to some logarithmic factor $ \|log({\epsilon})\| $ which grows linearly with respect to the PDE dimension. Various numerical results are presented to validate the performance of our method and to compare them with some recent machine learning schemes proposed in Han and al. [20] and Hur\'e and al. [37].
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2102.12051&r=all

Supporting Financial Inclusion with Graph Machine Learning and Super-App Alternative Data

By:	Luisa Roa; Andr\'es Rodr\'iguez-Rey; Alejandro Correa-Bahnsen; Carlos Valencia
Abstract:	The presence of Super-Apps have changed the way we think about the interactions between users and commerce. It then comes as no surprise that it is also redefining the way banking is done. The paper investigates how different interactions between users within a Super-App provide a new source of information to predict borrower behavior. To this end, two experiments with different graph-based methodologies are proposed, the first uses graph based features as input in a classification model and the second uses graph neural networks. Our results show that variables of centrality, behavior of neighboring users and transactionality of a user constituted new forms of knowledge that enhance statistical and financial performance of credit risk models. Furthermore, opportunities are identified for Super-Apps to redefine the definition of credit risk by contemplating all the environment that their platforms entail, leading to a more inclusive financial system.
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2102.09974&r=all

Modelos de Aprendizaje Automático Mediante Árboles de Decisión de Decisión

By:	Carlos Arana
Abstract:	Los modelos de aprendizaje automatico (machine learning) supervisados de clasificación mediante particiones binarias recursivas, tambien llamados "árboles de decisión" se encuentran entre los más utilizados en la ciencia de datos, no sólo por su interpretabilidad y su performance sino tambien por ser la base de los modelos más potentes utilizados en la actualidad: los ensambles de árboles de decisión. Al seguir siendo con siderados los modelos de clasificación por excelencia, es que en este trabajo presentaré sus fundamentos, sus elementos constitutivos y los procedimientos involucrados para su implementación, puesta en marcha y medición de performance predictiva.
Keywords:	Inteligencia Artificial, Aprendizaje Automático, Ciencia de Datos, Clasificadores, Árboles de Decisión
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:cem:doctra:778&r=all

Firm-Level Risk Exposures and Stock Returns in the Wake of COVID-19

By:	Steven J. Davis (University of Chicago; National Bureau of Economic Research (NBER); Hoover Institution); Stephen Hansen (Imperial College Business School); Cristhian Seminario-Amez (University of Chicago - Department of Economics)
Abstract:	Firm-level stock returns differ enormously in reaction to COVID-19 news. We characterize these reactions using the Risk Factors discussions in pre-pandemic 10-K filings and two text-analytic approaches: expert-curated dictionaries and supervised machine learning (ML). Bad COVID-19 news lowers returns for ï¬ rms with high exposures to travel, traditional retail, aircraft production and energy supply â€“ directly and via downstream demand linkages â€“ and raises them for ï¬ rms with high exposures to healthcare policy, e-commerce, web services, drug trials and materials that feed into supply chains for semiconductors, cloud computing and telecommunications. Monetary and fiscal policy responses to the pandemic strongly impact ï¬ rm-level returns as well, but differently than pandemic news. Despite methodological differences, dictionary and ML approaches yield remarkably congruent return predictions. Importantly though, ML operates on a vastly larger feature space, yielding richer characterizations of risk exposures and outperforming the dictionary approach in goodness-of-ï¬ t. By integrating elements of both approaches, we uncover new risk factors and sharpen our explanations for ï¬ rm-level returns. To illustrate the broader utility of our methods, we also apply them to explain ï¬ rm-level returns in reaction to the March 2020 Super Tuesday election results.
Date:	2020
URL:	http://d.repec.org/n?u=RePEc:bfi:wpaper:2020-139&r=all

Computing Consumer Sentiment in Germany via Social Media Data

By:	Karaman Örsal, Deniz Dilan; Sturm, Silke
Abstract:	Survey-based consumer confidence indicators are mostly reported with a delay and are a result of time consuming and expensive consumer surveys. In this study, to measure the current consumer confidence in Germany, we develop an approach, in which we compute the consumer sentiment using public Tweets from Germany. To achieve this goal we develop a new sentiment score. To measure the consumer sentiment, we use text-mining tools and public Tweets from May 2019 to August 2020. Our findings indicate that there is a high correlation between the consumer confidence indicator based on survey data, and the consumer sentiment that we compute using data from Twitter platform. With our approach, we are even able to forecast the change in next month's consumer confidence.
Keywords:	consumer sentiment,consumer confidence,Twitter,sentiment analysis
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:zbw:uhhhdp:7&r=all

Comparing Conventional and Machine-Learning Approaches to Risk Assessment in Domestic Abuse Cases

By:	Jeffrey Grogger (University of Chicago - Harris School of Public Policy; NBER); Sean Gupta (London School of Economics and Political Science - Center for Economic Performance); Ria Ivandic (London School of Economics and Political Science - Center for Economic Performance); Tom Kirchmaier (London School of Economics and Political Science - Center for Economic Performance)
Abstract:	We compare predictions from a conventional protocol-based approach to risk assessment with those based on a machine-learning approach. We first show that the conventional predictions are less accurate than, and have similar rates of negative prediction error as, a simple Bayes classifier that makes use only of the base failure rate. Machine learning algorithms based on the underlying risk assessment questionnaire do better under the assumption that negative prediction errors are more costly than positive prediction errors. Machine learning models based on two-year criminal histories do even better. Indeed, adding the protocol-based features to the criminal histories adds little to the predictive adequacy of the model. We suggest using the predictions based on criminal histories to prioritize incoming calls for service, and devising a more sensitive instrument to distinguish true from false positives that result from this initial screening.
JEL:	K14 K36
Date:	2020
URL:	http://d.repec.org/n?u=RePEc:bfi:wpaper:2021-01&r=all

Open Banking: Credit Market Competition When Borrowers Own the Data

By:	Zhiguo He (University of Chicago - Booth School of Business; NBER); Jing Huang (University of Chicago - Booth School of Business); Jidong Zhou (Yale University - School of Business)
Abstract:	Open banking facilitates data sharing consented by customers who generate the data, with a regulatory goal of promoting competition between traditional banks and challenger fintech entrants. We study lending market competition when sharing banksâ€™ customer data enables better borrower screening or targeting by fintech lenders. Open banking could make the entire financial industry better off yet leave all borrowers worse off, even if borrowers could choose whether to share their data. We highlight the importance of equilibrium credit quality inference from borrowersâ€™ endogenous sign-up decisions. When data sharing triggers privacy concerns by facilitating exploitative targeted loans, the equilibrium sign-up population can grow with the degree of privacy concerns.
Keywords:	Open banking, data sharing, banking competition, digital economy, winnerâ€™s curse, privacy, precision marketing
Date:	2020
URL:	http://d.repec.org/n?u=RePEc:bfi:wpaper:2020-168&r=all

Deep Video Prediction for Time Series Forecasting

By:	Zhen Zeng; Tucker Balch; Manuela Veloso
Abstract:	Time series forecasting is essential for decision making in many domains. In this work, we address the challenge of predicting prices evolution among multiple potentially interacting financial assets. A solution to this problem has obvious importance for governments, banks, and investors. Statistical methods such as Auto Regressive Integrated Moving Average (ARIMA) are widely applied to these problems. In this paper, we propose to approach economic time series forecasting of multiple financial assets in a novel way via video prediction. Given past prices of multiple potentially interacting financial assets, we aim to predict the prices evolution in the future. Instead of treating the snapshot of prices at each time point as a vector, we spatially layout these prices in 2D as an image, such that we can harness the power of CNNs in learning a latent representation for these financial assets. Thus, the history of these prices becomes a sequence of images, and our goal becomes predicting future images. We build on a state-of-the-art video prediction method for forecasting future images. Our experiments involve the prediction task of the price evolution of nine financial assets traded in U.S. stock markets. The proposed method outperforms baselines including ARIMA, Prophet, and variations of the proposed method, demonstrating the benefits of harnessing the power of CNNs in the problem of economic time series forecasting.
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2102.12061&r=all

Politician-Citizen Interactions and Dynamic Representation: Evidence from Twitter

By:	Aina Gallego; Gaël Le Mens; Nikolas Schöll
Abstract:	We study how politicians learn about public opinion through their regular interactions with citizens and how they respond to perceived changes. We model this process within a reinforcement learning framework: politicians talk about different policy issues, listen to feedback, and increase attention to better received issues. Because politicians are exposed to different feedback depending on their social identities, being responsive leads to divergence in issue attention over time. We apply these ideas to study the rise of gender issues. We collected 1.5 million tweets written by Spanish MPs, classified them using a deep learning algorithm, and measured feedback using retweets and likes. We find that politicians are responsive to feedback and that female politicians receive relatively more positive feedback for writing on gender issues. An analysis of mechanisms sheds light on why this happens. In the conclusion, we discuss how reinforcement learning can create unequal responsiveness, misperceptions, and polarization.
Keywords:	political responsiveness, representation, social media, gender
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:bge:wpaper:1238&r=all

Politician-citizen interactions and dynamic representation: Evidence from Twitter

By:	Aina Gallego; Nikolas Schöll; Gaël Le Mens
Abstract:	We study how politicians learn about public opinion through their regular interactions with citizens and how they respond to perceived changes. We model this process within a reinforcement learning framework: politicians talk about different policy issues, listen to feedback, and increase attention to better received issues. Because politicians are exposed to different feedback depending on their social identities, being responsive leads to divergence in issue attention over time. We apply these ideas to study the rise of gender issues. We collected 1.5 million tweets written by Spanish MPs, classified them using a deep learning algorithm, and measured feedback using retweets and likes. We find that politicians are responsive to feedback and that female politicians receive relatively more positive feedback for writing on gender issues. An analysis of mechanisms sheds light on why this happens. In the conclusion, we discuss how reinforcement learning can create unequal responsiveness, misperceptions, and polarization.
Keywords:	political responsiveness, representation, social media, gender
Date:	2021–01
URL:	http://d.repec.org/n?u=RePEc:upf:upfgen:1769&r=all

CEO behavior and firm performance

By:	Bandiera, Oriana; Prat, Andrea; Hansen, Stephen; Sadun, Raffaella
Abstract:	We develop a new method to measure CEO behavior in large samples via a survey that collects high-frequency, high-dimensional diary data and a machine learning algorithm that estimates behavioral types. Applying this method to 1,114 CEOs in six countries reveals two types: “leaders,” who do multifunction, high-level meetings, and “managers,” who do individual meetings with core functions. Firms that hire leaders perform better, and it takes three years for a new CEO to make a difference. Structural estimates indicate that productivity differentials are due to mismatches rather than to leaders being better for all firms.
JEL:	J50
Date:	2020–04–01
URL:	http://d.repec.org/n?u=RePEc:ehl:lserod:101423&r=all

"For the times they are a-changin": Gauging Uncertainty Perception over Time

By:	Müller, Henrik; Hornig, Nico; Rieger, Jonas
Abstract:	This paper deals with the problem of deriving consistent time-series from newspaper contentbased topic models. In the first part, we recapitulate a few our own failed attempts, in the second one, we show some results using a twin strategy, that we call prototyping and seeding. Given the popularity news-based indicators have assumed in econometric analyses in recent years, this seems to be a valuable exercise for researchers working on related issues. Building on earlier writings, where we use the topic modelling approach Latent Dirichlet Allocation (LDA) to gauge economic uncertainty perception, we show the difficulties that arise when a number of one-shot LDAs, performed at different points in time, are used to produce something akin of a time-series. The models' topic structures differ considerably from computation to computation. Neither parameter variations nor the accumulation of several topics to broader categories of related content are able solve the problem of incompatibleness. It is not just the content that is added at each observation point, but the very properties of LDA itself: since it uses random initializations and conditional reassignments within the iterative process, fundamentally different models can emerge when the algorithm is executed several times, even if the data and the parameter settings are identical. To tame LDA's randomness, we apply a newish "prototyping" approach to the corpus, upon which our Uncertainty Perception Indicator (UPI) is built. Still, the outcomes vary considerably over time. To get closer to our goal, we drop the notion that LDA models should be allowed to take various forms freely at each run. Instead, the topic structure is fixated, using a "seeding" technique that distributes incoming new data to our model's existing topic structure. This approach seems to work quite well, as our consistent and plausible results show, but it is bound to run into difficulties over time either.
Keywords:	uncertainty,economic policy,business cycles,Covid-19,Latent Dirichlet Allocation,Seeded LDA
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:zbw:docmaw:3&r=all

Artificial Intelligence, Teacher Tasks and Individualized Pedagogy

By:	Ferman, Bruno; Lima, Lycia; Riva, Flávio
Abstract:	This paper investigates how educational technologies that use different combinations of artificial and human intelligence are incorporated into classroom instruction, and how they ultimately affect learning. We conducted a field experiment to study two technologies that allow teachers to outsource grading and feedback tasks on writing practices of high school seniors. The first technology is a fully automated evaluation system that provides instantaneous scores and feedback. The second one uses human graders as an additional resource to enhance grading and feedback quality in aspects in which the automated system arguably falls short. Both technologies significantly improved students' essay scores in a large college admission exam, and the addition of human graders did not improve effectiveness in spite of increasing perceived feedback quality. Both technologies also similarly helped teachers engage more frequently on personal discussions on essay quality with their students. Taken together, these results indicate that teachers' task composition shifted toward nonroutine activities and this helped circumvent some of the limitations of artificial intelligence. More generally, our results illustrate how the most recent wave of technological change may relocate labor to analytical and interactive tasks that still remain a challenge to automation.
Date:	2021–02–17
URL:	http://d.repec.org/n?u=RePEc:osf:socarx:qw249&r=all

Governance of Data Sharing : a Law & Economics Proposal

By:	Graef, Inge (Tilburg University, TILEC); Prüfer, Jens (Tilburg University, TILEC)
Keywords:	Data sharing; data-driven markets; economic governance; competition law; data protection; regulation
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:tiu:tiutil:b64b51f8-16af-45c8-87ea-371a9551ec7d&r=all

Données de santé : l'arbre StopCovid qui cache la forêt Health Data Hub

By:	Bernard Fallery (MRM - Montpellier Research in Management - UM - Université de Montpellier - Groupe Sup de Co Montpellier (GSCM) - Montpellier Business School - UM1 - Université Montpellier 1 - UPVD - Université de Perpignan Via Domitia - UM2 - Université Montpellier 2 - Sciences et Techniques - UPVM - Université Paul-Valéry - Montpellier 3)
Abstract:	Le projet StopCovid pour un traçage social « acceptable » à l'aide des smartphones a focalisé l'intérêt de tous. Mais pendant ce temps-là... un projet bien plus large continue à marche forcée, celui de la plateforme des données de santé (en franglais Health Data Hub, HDH). Dès la remise du rapport Villani en mars 2018, le Président de la République annonce le projet HDH. Dès octobre, une mission de préfiguration définit les traits d'un système national centralisé de l'ensemble des données de santé publique, un guichet unique à partir duquel l'intelligence Artificielle IA pourrait optimiser des services de reconnaissance des formes et de prédiction personnalisée. L'écosystème de l'IA s'apprête ainsi à franchir une nouvelle marche en obtenant l'accès à des données massives provenant des hôpitaux, de la recherche, de la médecine de ville, des objets connectés, etc., et à un marché massif de la santé (prestigieux et à valeur potentielle énorme dans la mesure où il pèse plus de 12 % du PIB). Contourner l'arbre qui cache la forêt, c'est découvrir toute l'étendue des questions posées par la « transformation numérique » dans la société, et ici dans la santé.
Keywords:	Données de santé,Intelligence artficielle,Microsoft,Health data hub
Date:	2020–05
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-03125892&r=all

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.