nep-big New Economics Papers
on Big Data
Issue of 2021‒07‒12
28 papers chosen by
Tom Coupé
University of Canterbury

  1. Sentiment and Uncertainty about Regulation By Tara M. Sinclair; Zhoudan Xie
  2. Machine learning in the prediction of flat horse racing results in Poland By Piotr Borowski; Marcin Chlebus
  3. Forecasting Canadian GDP Growth with Machine Learning By Shafiullah Qureshi; Ba Chu; Fanny S. Demers
  4. Words Matter: Gender, Jobs and Applicant Behavior By Sugat Chaturvedi; Kanika Mahajan; Zahra Siddique
  5. Words Matter: Gender, Jobs and Applicant Behavior By Sugat Chaturvedi; Kanika Mahajan; Zahra Siddique
  6. The strength of weak and strong ties in bridging geographic and cognitive distances By Abbasiharofteh, Milad; Kinne, Jan; Krüger, Miriam
  7. Words Matter: Gender, Jobs and Applicant Behavior By Sugat Chaturvedi; Kanika Mahajan; Zahra Siddique
  8. Kundenbindung in der Finanzindustrie - Ein empirischer Ansatz By Zettler, Julia
  9. Catching The Drivers of Inclusive Growth in Sub-Saharan Africa: An Application of Machine Learning By Ofori, Isaac Kwesi
  10. Catching the Drivers of Inclusive Growth in Sub-Saharan Africa: An Application of Machine Learning By Isaac K. Ofori
  11. Comparison of the accuracy in VaR forecasting for commodities using different methods of combining forecasts By Szymon Lis; Marcin Chlebus
  12. Catching the Drivers of Inclusive Growth in Sub-Saharan Africa: An Application of Machine Learning By Isaac K. Ofori
  13. Local inequalities of the COVID-19 crisis By Cerqua, Augusto; Letta, Marco
  14. Price discrimination with inequity-averse consumers: A reinforcement learning approach By Buchali, Katrin
  15. A liquidity risk early warning indicator for Italian banks: a machine learning approach By Maria Ludovica Drudi; Stefano Nobili
  16. Shapes as Product Differentiation: Neural Network Embedding in the Analysis of Markets for Fonts By Sukjin Han; Eric Schulman; Kristen Grauman; Santhosh Ramakrishnan
  17. Epidemic Exposure, Fintech Adoption, and the Digital Divide By Saka, Orkun; Eichengreen, Barry; Aksoy, Cevat Giray
  18. Active Labour Market Policies for the Long-Term Unemployed: New Evidence from Causal Machine Learning By Goller, Daniel; Harrer, Tamara; Lechner, Michael; Wolff, Joachim
  19. Scalable Econometrics on Big Data -- The Logistic Regression on Spark By Aur\'elien Ouattara; Matthieu Bult\'e; Wan-Ju Lin; Philipp Scholl; Benedikt Veit; Christos Ziakas; Florian Felice; Julien Virlogeux; George Dikos
  20. The data archive as factory: alienation and resistance of data processors By Plantin, Jean-Christophe
  21. Analysis of feature influence on Covid-19 Death Rate Per Country Using a Novel Orthogonalization Technique By Gonnet, Gaston H.; Stewart, John; Lafleur, Joseph; Keith, Stephen; McLellan, Mark; Jiang-Gorsline, David; Snider, Tim
  22. Don’t Worry, Be Happy – But Only Seasonally By Mateusz Kijewski; Szymon Lis; Michał Woźniak; Maciej Wysocki
  23. Information theoretic causality detection between financial and sentiment data By Scaramozzino, Roberta; Cerchiello, Paola; Aste, Tomaso
  24. Quantifying the Impact of Human Capital, Job History, and Language Factors on Job Seniority with a Large-scale Analysis of Resumes By Austin P Wright; Caleb Ziems; Haekyu Park; Jon Saad-Falcon; Duen Horng Chau; Diyi Yang; Maria Tomprou
  25. The impact of COVID-19 on analysts’ sentiment about the banking sector By Alicia Aguilar; Diego Torres
  26. Pandemic perception and regulation effectiveness: Evidence from the COVID-19 By Luisa Loiacono; Riccardo Puglisi; Leonzio Rizzo; Riccardo Secomandi
  27. Anxiety, Expectations Stabilization and Intertemporal Markets: Theory, Evidence and Policy By Francesco Carbonero; Jeremy Davies; Ekkehard Ernst; Sayantan Ghosal; Leaza McSorley
  28. Measuring and Evaluating Strategic Communications at the Bank of Canada By Annie Portelance

  1. By: Tara M. Sinclair (The George Washington University); Zhoudan Xie (The George Washington University)
    Abstract: Regulatory policy can create economic and social benets, but poorly designed or excessive regulation may generate substantial adverse effects on the economy. In this paper, we present measures of sentiment and uncertainty about regulation in the U.S. over time and examine their relationships with macroeconomic performance. We construct the measures using lexicon-based sentiment analysis of an original news corpus, which covers 493,418 news articles related to regulation from seven leading U.S. newspapers. As a result, we build monthly indexes of sentiment and uncertainty about regulation and categorical indexes for 14 regulatory policy areas from January 1985 to August 2020. Impulse response functions indicate that a negative shock to sentiment about regulation is associated with large, persistent drops in future output and employment, while increased regulatory uncertainty overall reduces output and employment temporarily. These results suggest that sentiment about regulation plays a more important economic role than uncertainty about regulation. Furthermore, economic outcomes are particularly sensitive to sentiment around transportation regulation and to uncertainty around labor regulation.
    Keywords: Regulation, text analysis, NLP, sentiment analysis, uncertainty
    JEL: E2 E3 K2 O4
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:gwc:wpaper:2021-004&r=
  2. By: Piotr Borowski (Faculty of Economic Sciences, University of Warsaw); Marcin Chlebus (Faculty of Economic Sciences, University of Warsaw)
    Abstract: Horse racing was the source of many researchers considerations who studied market efficiency and applied complex mathematic formulas to predict their results. We were the first who compared the selected machine learning methods to create a profitable betting strategy for two common bets, Win and Quinella. The six classification algorithms under the different betting scenarios were used, namely Classification and Regression Tree (CART), Generalized Linear Model (Glmnet), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Neural Network (NN) and Linear Discriminant Analysis (LDA). Additionally, the Variable Importance was applied to determine the leading horse racing factors. The data were collected from the flat racetracks in Poland from 2011-2020 and described 3,782 Arabian and Thoroughbred races in total. We managed to profit under specific circumstances and get a correct bets ratio of 41% for the Win bet and over 36% for the Quinella bet using LDA and Neural Networks. The results demonstrated that it was possible to bet effectively using the chosen methods and indicated a possible market inefficiency.
    Keywords: horse racing prediction, racetrack betting, Thoroughbred and Arabian flat racing, machine learning, Variable Importance
    JEL: C53 C55 C45
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:war:wpaper:2021-13&r=
  3. By: Shafiullah Qureshi (Department of Economics, Carleton University); Ba Chu (Department of Economics, Carleton University); Fanny S. Demers (Department of Economics, Carleton University)
    Abstract: This paper applies state-of-the-art machine learning (ML) algorithms to forecast monthly real GDP growth in Canada by using both Google Trends (GT) data and official macroeconomic data (which are available ahead of the release of GDP data by Statistics Canada). We show that we can forecast real GDP growth accurately ahead of the release of GDP figures by using GT and official data (such as employment) as predictors. We first pre-select features by applying up-to-date techniques, namely, XGBoost’s variable importance score, and a recent variable-screening procedure for time series data, namely, PDC-SIS+. These pre-selected features are then used to build advanced ML models for forecasting real GDP growth, by employing tree-based ensemble algorithms, such as XGBoost, LightGBM, Random Forest, and GBM. We provide empirical evidence that the variables pre-selected by either PDC-SIS+ or the XGBoost’s variable importance score can have a superior forecasting ability. We find that the pre-selected GT data features perform as well as the pre-selected official data features with respect to short-term forecasting ability, while the pre-selected official data features are superior with respect to long-term forecasting ability. We also find that (1) the ML algorithms we employ often perform better with a large sample than with a small sample, even when the small sample has a larger set of predictors; and (2) the Random Forest (that often produces nonlinear models to capture nonlinear patterns in the data) tends to under-perform a standard autoregressive model in several cases while there is no clear evidence that the XGBoost and the LightGBM can always outperform each other.
    Date: 2021–05–17
    URL: http://d.repec.org/n?u=RePEc:car:carecp:21-05&r=
  4. By: Sugat Chaturvedi (Indian Statistical Institute, Delhi); Kanika Mahajan (Ashoka University); Zahra Siddique (University of Bristol)
    Abstract: We examine employer preferences for hiring men vs women using 160,000 job ads posted on an online job portal in India, linked with more than 6 million applications. We apply machine learning algorithms on text contained in job ads to predict an employer's gender preference. We find that advertised wages are lowest in jobs where employers prefer women, even when this preference is implicitly retrieved through the text analysis, and that these jobs also attract a larger share of female applicants. We then systematically uncover what lies beneath these relationships by retrieving words that are predictive of an explicit gender preference, or gendered words, and assigning them to the categories of hard and soft-skills, personality traits, and flexibility. We find that skills related female-gendered words have low returns but attract a higher share of female applicants while male-gendered words indicating decreased flexibility (e.g., frequent travel or unusual working hours) have high returns but result in a smaller share of female applicants. This contributes to a gender earnings gap. Our findings illustrate how gender preferences are partly driven by stereotypes and statistical discrimination.
    Keywords: Gender, Job portal, Machine learning
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:ash:wpaper:63&r=
  5. By: Sugat Chaturvedi (Indian Statistical Institute, Delhi); Kanika Mahajan (Ashoka University); Zahra Siddique (University of Bristol)
    Abstract: We examine employer preferences for hiring men vs women using 160, 000 job ads posted on an online job portal in India, linked with more than 6 million applications. We apply machine learning algorithms on text contained in job ads to predict an employer’s gender preference. We find that advertised wages are lowest in jobs where employers prefer women, even when this preference is implicitly retrieved through the text analysis, and that these jobs also attract a larger share of female applicants. We then systematically uncover what lies beneath these relationships by retrieving words that are predictive of an explicit gender preference, or gendered words, and assigning them to the categories of hard and soft-skills, personality traits, and flexibility. We find that skills related female-gendered words have low returns but attract a higher share of female applicants while male-gendered words indicating decreased flexibility (e.g., frequent travel or unusual working hours) have high returns but result in a smaller share of female applicants. This contributes to a gender earnings gap. Our findings illustrate how gender preferences are partly driven by stereotypes and statistical discrimination.
    Keywords: gender, job portal, machine learning
    JEL: J16 J63 J71
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:alo:isipdp:21-03&r=
  6. By: Abbasiharofteh, Milad; Kinne, Jan; Krüger, Miriam
    Abstract: The proximity framework has attracted considerable attention in a scholarly discourse on the driving forces of knowledge exchange tie formation. It has been discussed that too much proximity is negatively associated with the effectiveness of a knowledge exchange relation. However, little is known about the key factors that trigger the formation of the boundaryspanning knowledge ties. Going beyond the "dyadic" perspective on proximity dimensions, this paper argues that the key factor in bridging distances may reside at the "triadic" level. We build on the notion of "the strength of weak ties" and its recent development by investigating the innovative performance and relations of more than 600,000 German firms. We explored and extracted information from the textual and relational content of firms' websites by using machine learning techniques and hyperlink analysis. We thereby proxied the innovative performance of firms using a deep learning text analysis approach and showed that the triadic property of bridging dyadic relations is a reliable predictor of firms' innovativeness. Relations embedded in cliques (i.e., strong ties) that connect cognitively distant firms are more strongly associated with firms' innovation, whereas inter-regional relations connecting different parts of a network (i.e., weak ties) are positively associated with firms' innovative performance. Also, the results suggest that a combination of strong inter-community and weak inter-regional relations are more positively related with firms' innovativeness compared to the combination of other relation types.
    Keywords: weak and strong ties,proximity,knowledge exchange,innovation,web mining,natural language processing
    JEL: C81 D83 L14 O31
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:zewdip:21049&r=
  7. By: Sugat Chaturvedi; Kanika Mahajan; Zahra Siddique
    Abstract: We examine employer preferences for hiring men vs women using 160, 000 job ads posted on an online job portal in India, linked with more than 6 million applications. We apply machine learning algorithms on text contained in job ads to predict an employer’s gender preference. We find that advertised wages are lowest in jobs where employers prefer women, even when this preference is implicitly retrieved through the text analysis, and that these jobs also attract a larger share of female applicants. We then systematically uncover what lies beneath these relationships by retrieving words that are predictive of an explicit gender preference, or gendered words, and assigning them to the categories of hard and soft-skills, personality traits, and flexibility. We find that skills related female-gendered words have low returns but attract a higher share of female applicants while male-gendered words indicating decreased flexibility (e.g., frequent travel or unusual working hours) have high returns but result in a smaller share of female applicants. This contributes to a gender earnings gap. Our findings illustrate how gender preferences are partly driven by stereotypes and statistical discrimination.
    Date: 2021–06–22
    URL: http://d.repec.org/n?u=RePEc:bri:uobdis:21/747&r=
  8. By: Zettler, Julia
    Abstract: Die Niedrigszinsphase, die voranschreitende Regulierung und die Digitalisierung erhöhen den Margendruck bei einer stärkeren wettbewerblichen Ausrichtung des Retailbanking-Marktes. Gerade bei etablierten Finanzinstituten mit hohen Marktanteilen wie zum Beispiel den Sparkassen, sind Informationen über mögliche Einflussparameter auf die Kundenbindung erfolgskritisch. Diese Studie betrachtet ausgewählte endogene Kundenbindungsmechanismen. Es wird zum Beispiel der Einfluss des Produktportfolios eines privaten Kunden auf seine Kundenbindung analysiert. Die Analyse erfolgt über einen fundierten datenbasierten Ansatz: Resultate von State of the Art Machine Learning-Ansätzen werden mit Ergebnissen umfangreicher deskriptiver Analysen plausibilisiert. Unter anderem wird gezeigt, dass die Anzahl der Produkte die Art der Produkte bezogen auf die Aufrechterhaltung der Kundenbindung als Indikator dominiert.
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:dar:wpaper:127410&r=
  9. By: Ofori, Isaac Kwesi
    Abstract: A conspicuous lacuna in the literature on Sub-Saharan Africa (SSA) is the lack of clarity on variables key for driving and predicting inclusive growth. To address this, I train the machine learning algorithms for the Standard lasso, the Minimum Schwarz Bayesian Information Criterion (Minimum BIC) lasso, and the Adaptive lasso to study patterns in a dataset comprising 97 covariates of inclusive growth for 43 SSA countries. First, the regularization results show that only 13 variables are key for driving inclusive growth in SSA. Further, the results show that out of the 13, the poverty headcount (US$1.90) matters most. Second, the findings reveal that ‘Minimum BIC lasso’ is best for predicting inclusive growth in SSA. Policy recommendations are provided in line with the region’s green agenda and the coming into force of the African Continental Free Trade Area.
    Keywords: Clean Fuel,Economic Growth,Machine Learning,Lasso,Sub-Saharan Africa,Regularization,Poverty
    JEL: C01 C14 C51 C52 C55 F43 O4 O55
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:esprep:235482&r=
  10. By: Isaac K. Ofori (University of Insubria, Varese, Italy)
    Abstract: A conspicuous lacuna in the literature on Sub-Saharan Africa (SSA) is the lack of clarity on variables key for driving and predicting inclusive growth. To address this, I train the machine learning algorithms for the Standard lasso, the Minimum Schwarz Bayesian Information Criterion (Minimum BIC) lasso, and the Adaptive lasso to study patterns in a dataset comprising 97 covariates of inclusive growth for 43 SSA countries. First, the regularization results show that only 13 variables are key for driving inclusive growth in SSA. Further, the results show that out of the 13, the poverty headcount (US$1.90) matters most. Second, the findings reveal that ‘Minimum BIC lasso’ is best for predicting inclusive growth in SSA. Policy recommendations are provided in line with the region’s green agenda and the coming into force of the African Continental Free Trade Area.
    Keywords: Clean Fuel, Economic Growth, Machine Learning, Lasso, Sub-Saharan Africa, Regularization, Poverty
    JEL: C01 C14 C51 C52 C55 F43 O4 O55
    Date: 2021–01
    URL: http://d.repec.org/n?u=RePEc:agd:wpaper:21/044&r=
  11. By: Szymon Lis (Faculty of Economic Sciences, University of Warsaw); Marcin Chlebus (Faculty of Economic Sciences, University of Warsaw)
    Abstract: No model dominates existing VaR forecasting comparisons. This problem may be solved by combine forecasts. This study investigates the daily volatility forecasting for commodities (gold, silver, oil, gas, copper) from 2000-2020 and identifies the source of performance improvements between individual GARCH models and combining forecasts methods (mean, the lowest, the highest, CQOM, quantile regression with the elastic net or LASSO regularization, random forests, gradient boosting, neural network) through the MCS. Results indicate that individual models achieve more accurate VaR forecasts for the confidence level of 0.975, but combined forecasts are more precise for 0.99. In most cases simple combining methods (mean or the lowest VaR) are the best. Such evidence demonstrates that combining forecasts is important to get better results from the existing models. The study shows that combining the forecasts allows for more accurate VaR forecasting, although it’s difficult to find accurate, complex methods.
    Keywords: Combining forecasts, Econometric models, Finance, Financial markets, GARCH models, Neural networks, Regression, Time series, Risk, Value-at-Risk, Machine learning, Model Confidence Set
    JEL: C51 C52 C53 G32 Q01
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:war:wpaper:2021-11&r=
  12. By: Isaac K. Ofori (University of Insubria, Varese, Italy)
    Abstract: A conspicuous lacuna in the literature on Sub-Saharan Africa (SSA) is the lack of clarity on variables key for driving and predicting inclusive growth. To address this, I train the machine learning algorithms for the Standard lasso, the Minimum Schwarz Bayesian Information Criterion (Minimum BIC) lasso, and the Adaptive lasso to study patterns in a dataset comprising 97 covariates of inclusive growth for 43 SSA countries. First, the regularization results show that only 13 variables are key for driving inclusive growth in SSA. Further, the results show that out of the 13, the poverty headcount (US$1.90) matters most. Second, the findings reveal that ‘Minimum BIC lasso’ is best for predicting inclusive growth in SSA. Policy recommendations are provided in line with the region’s green agenda and the coming into force of the African Continental Free Trade Area.
    Keywords: Economic Integration, Financial Deepening, GMM, MENA, Globalisation, Inequality, Poverty
    JEL: F14 F15 F6 I3 O53
    Date: 2021–01
    URL: http://d.repec.org/n?u=RePEc:exs:wpaper:21/044&r=
  13. By: Cerqua, Augusto; Letta, Marco
    Abstract: This paper assesses the impact of the first wave of the pandemic on the local economies of one of the hardest-hit countries, Italy. We combine quarterly local labor market data with the new machine learning control method for counterfactual building. Our results document that the economic effects of the COVID-19 shock are dramatically unbalanced across the Italian territory and spatially uncorrelated with the epidemiological pattern of the first wave. The heterogeneity of employment losses is associated with exposure to social aggregation risks and pre-existing labor market fragilities. Finally, we quantify the protective role played by the labor market interventions implemented by the government and show that, while effective, they disproportionately benefitted the most developed Italian regions. Such diverging trajectories and unequal policy effects call for a place-based policy approach that promptly addresses the uneven economic geography of the current crisis.
    Keywords: impact evaluation,counterfactual approach,machine learning,local labor markets,COVID-19,Italy
    JEL: C53 D22 E24 R12
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:glodps:875&r=
  14. By: Buchali, Katrin
    Abstract: With the advent of big data, unique opportunities arise for data collection and analysis and thus for personalized pricing. We simulate a self-learning algorithm setting personalized prices based on additional information about consumer sensitivities in order to analyze market outcomes for consumers who have a preference for fair, equitable outcomes. For this purpose, we compare a situation that does not consider fairness to a situation in which we allow for inequity-averse consumers. We show that the algorithm learns to charge different, revenue-maximizing prices and simultaneously increase fairness in terms of a more homogeneous distribution of prices.
    Keywords: pricing algorithm,reinforcement learning,Q-learning,price discrimi-nation,fairness,inequity
    JEL: D63 D91 L12
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:hohdps:022021&r=
  15. By: Maria Ludovica Drudi (Bank of Italy); Stefano Nobili (Bank of Italy)
    Abstract: The paper develops an early warning system to identify banks that could face liquidity crises. To obtain a robust system for measuring banks’ liquidity vulnerabilities, we compare the predictive performance of three models – logistic LASSO, random forest and Extreme Gradient Boosting – and of their combination. Using a comprehensive dataset of liquidity crisis events between December 2014 and January 2020, our early warning models’ signals are calibrated according to the policymaker's preferences between type I and II errors. Unlike most of the literature, which focuses on default risk and typically proposes a forecast horizon ranging from 4 to 6 quarters, we analyse liquidity risk and we consider a 3-month forecast horizon. The key finding is that combining different estimation procedures improves model performance and yields accurate out-of-sample predictions. The results show that the combined models achieve an extremely low percentage of false negatives, lower than the values usually reported in the literature, while at the same time limiting the number of false positives.
    Keywords: banking crisis, early warning models, liquidity risk, lender of last resort, machine learning
    JEL: C52 C53 G21 E58
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:bdi:wptemi:td_1337_21&r=
  16. By: Sukjin Han; Eric Schulman; Kristen Grauman; Santhosh Ramakrishnan
    Abstract: Many differentiated products have key attributes that are unstructured and thus high-dimensional (e.g., design, text). Instead of treating unstructured attributes as unobservables in economic models, quantifying them can be important to answer interesting economic questions. To propose an analytical framework for this type of products, this paper considers one of the simplest design products-fonts-and investigates merger and product differentiation using an original dataset from the world's largest online marketplace for fonts. We quantify font shapes by constructing embeddings from a deep convolutional neural network. Each embedding maps a font's shape onto a low-dimensional vector. In the resulting product space, designers are assumed to engage in Hotelling-type spatial competition. From the image embeddings, we construct two alternative measures that capture the degree of design differentiation. We then study the causal e ects of a merger on the merging firm's creative decisions using the constructed measures in a synthetic control method. We find that the merger causes the merging firm to increase the visual variety of font design. Notably, such effects are not captured when using traditional measures for product offerings (e.g., specifications and the number of products) constructed from structured data.
    Date: 2021–07–08
    URL: http://d.repec.org/n?u=RePEc:bri:uobdis:21/750&r=
  17. By: Saka, Orkun; Eichengreen, Barry; Aksoy, Cevat Giray
    Abstract: We ask whether epidemic exposure leads to a shift in financial technology usage within and across countries and if so who participates in this shift. We exploit a dataset combining Gallup World Polls and Global Findex surveys for some 250,000 individuals in 140 countries, merging them with information on the incidence of epidemics and local 3G internet infrastructure. Epidemic exposure is associated with an increase in remote-access (online/mobile) banking and substitution from bank branch-based to ATM-based activity. Using a machine-learning algorithm, we show that heterogeneity in this response centers on the age, income and employment of respondents. Young, high-income earners in full-time employment have the greatest propensity to shift to online/mobile transactions in response to epidemics. These effects are larger for individuals in subnational regions with better ex ante 3G signal coverage, highlighting the role of the digital divide in adaption to new technologies necessitated by adverse external shocks.
    Date: 2021–07–02
    URL: http://d.repec.org/n?u=RePEc:osf:socarx:b6nv3&r=
  18. By: Goller, Daniel (University of St. Gallen); Harrer, Tamara (Institute for Employment Research (IAB), Nuremberg); Lechner, Michael (University of St. Gallen); Wolff, Joachim (Institute for Employment Research (IAB), Nuremberg)
    Abstract: We investigate the effectiveness of three different job-search and training programmes for German long-term unemployed persons. On the basis of an extensive administrative data set, we evaluated the effects of those programmes on various levels of aggregation using Causal Machine Learning. We found participants to benefit from the investigated programmes with placement services to be most effective. Effects are realised quickly and are long-lasting for any programme. While the effects are rather homogenous for men, we found differential effects for women in various characteristics. Women benefit in particular when local labour market conditions improve. Regarding the allocation mechanism of the unemployed to the different programmes, we found the observed allocation to be as effective as a random allocation. Therefore, we propose data-driven rules for the allocation of the unemployed to the respective labour market programmes that would improve the status-quo.
    Keywords: policy evaluation, Modified Causal Forest (MCF), active labour market programmes, conditional average treatment effect (CATE)
    JEL: J08 J68
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp14486&r=
  19. By: Aur\'elien Ouattara; Matthieu Bult\'e; Wan-Ju Lin; Philipp Scholl; Benedikt Veit; Christos Ziakas; Florian Felice; Julien Virlogeux; George Dikos
    Abstract: Extra-large datasets are becoming increasingly accessible, and computing tools designed to handle huge amount of data efficiently are democratizing rapidly. However, conventional statistical and econometric tools are still lacking fluency when dealing with such large datasets. This paper dives into econometrics on big datasets, specifically focusing on the logistic regression on Spark. We review the robustness of the functions available in Spark to fit logistic regression and introduce a package that we developed in PySpark which returns the statistical summary of the logistic regression, necessary for statistical inference.
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2106.10341&r=
  20. By: Plantin, Jean-Christophe
    Abstract: Archival data processing consists of cleaning and formatting data between the moment a dataset is deposited and its publication on the archive’s website. In this article, I approach data processing by combining scholarship on invisible labor in knowledge infrastructures with a Marxian framework and show the relevance of considering data processing as factory labor. Using this perspective to analyze ethnographic data collected during a six-month participatory observation at a U.S. data archive, I generate a taxonomy of the forms of alienation that data processing generates, but also the types of resistance that processors develop, across four categories: routine, speed, skill, and meaning. This synthetic approach demonstrates, first, that data processing reproduces typical forms of factory worker’s alienation: processors are asked to work along a strict standardized pipeline, at a fast pace, without acquiring substantive skills or having a meaningful involvement in their work. It reveals, second, how data processors resist the alienating nature of this workflow by developing multiple tactics along the same four categories. Seen through this dual lens, data processors are therefore not only invisible workers, but also factory workers who follow and subvert a workflow organized as an assembly line. I conclude by proposing a four-step framework to better value the social contribution of data workers beyond the archive.
    Keywords: data workers; invisible labor; data archive; knowledge infrastructure; data processing; alienation; Internal OA fund
    JEL: R14 J01
    Date: 2021–04–06
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:109692&r=
  21. By: Gonnet, Gaston H.; Stewart, John; Lafleur, Joseph; Keith, Stephen; McLellan, Mark; Jiang-Gorsline, David; Snider, Tim
    Abstract: We have developed a new technique of Feature Importance, a topic of machine learning, to analyze the possible causes of the Covid-19 pandemic based on country data. This new approach works well even when there are many more features than countries and is not affected by high correlation of features. It is inspired by the Gram-Schmidt orthogonalization procedure from linear algebra. We study the number of deaths, which is more reliable than the number of cases at the onset of the pandemic, during Apr/May 2020. This is while countries started taking measures, so more light will be shed on the root causes of the pandemic rather than on its handling. The analysis is done against a comprehensive list of roughly 3,200 features. We find that globalization is the main contributing cause, followed by calcium intake, economic factors, environmental factors, preventative measures, and others. This analysis was done for 20 different dates and shows that some factors, like calcium, phase in or out over time. We also compute row explainability, i.e. for every country, how much each feature explains the death rate. Finally we also study a series of conditions, e.g. comorbidities, immunization, etc. which have been proposed to explain the pandemic and place them in their proper context. While there are many caveats to this analysis, we believe it sheds light on the possible causes of the Covid-19 pandemic.
    Date: 2021–07–02
    URL: http://d.repec.org/n?u=RePEc:osf:metaar:4kw2n&r=
  22. By: Mateusz Kijewski (Faculty of Economic Sciences, University of Warsaw); Szymon Lis (Faculty of Economic Sciences, University of Warsaw); Michał Woźniak (Faculty of Economic Sciences, University of Warsaw); Maciej Wysocki (Faculty of Economic Sciences, University of Warsaw)
    Abstract: Current scientific knowledge allows us to assess the impact of socioeconomic variables on musical preferences. The research methods in these studies were psychological experiments and surveys conducted on small groups or analyzing the influence of only one or two variables at the level of the whole society. Instead inspired by the article of The Economist about February being the gloomiest month in terms of music listened to, we have created a dataset with many different variables that will allow us to create more reliable models than the previous datasets. We used the Spotify API to create a monthly dataset with average valence for 26 countries for the period from January 1, 2018, to December 1, 2019. Our study almost fully confirmed the effects of summer, December, and number of Saturdays in a month and contradicted the February effect. In the context of the index of freedom and diversity, the models do not show much consistency. The influence of GDP per capita on the valence was confirmed, while the impact of the happiness index was disproved. All models partially confirmed the influence of the music genre on the valence. Among the weather variables, two models confirmed the significance of the temperature variable. All in all, effects analyzed by us can broaden artists' knowledge of when to release new songs or support recommendation engines for streaming services.
    Keywords: valence, spotify, happiness, statistical panel analysis, explainable machine learning
    JEL: C01 C23 I31
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:war:wpaper:2021-12&r=
  23. By: Scaramozzino, Roberta; Cerchiello, Paola; Aste, Tomaso
    Abstract: The interaction between the flow of sentiment expressed on blogs and media and the dynamics of the stock market prices are analyzed through an information-theoretic measure, the transfer entropy, to quantify causality relations. We analyzed daily stock price and daily social media sentiment for the top 50 companies in the Standard & Poor (S&P) index during the period from November 2018 to November 2020. We also analyzed news mentioning these companies during the same period. We found that there is a causal flux of information that links those companies. The largest fraction of significant causal links is between prices and between sentiments, but there is also significant causal information which goes both ways from sentiment to prices and from prices to sentiment. We observe that the strongest causal signal between sentiment and prices is associated with the Tech sector.
    Keywords: information theory; textual analysis; transfer entropy; financial news; causality; time series; ES/K002309/1; EP/P031730/1; H2020-ICT-2018-2 825215
    JEL: F3 G3 C1
    Date: 2021–05–16
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:110903&r=
  24. By: Austin P Wright; Caleb Ziems; Haekyu Park; Jon Saad-Falcon; Duen Horng Chau; Diyi Yang; Maria Tomprou
    Abstract: As job markets worldwide have become more competitive and applicant selection criteria have become more opaque, and different (and sometimes contradictory) information and advice is available for job seekers wishing to progress in their careers, it has never been more difficult to determine which factors in a r\'esum\'e most effectively help career progression. In this work we present a novel, large scale dataset of over half a million r\'esum\'es with preliminary analysis to begin to answer empirically which factors help or hurt people wishing to transition to more senior roles as they progress in their career. We find that previous experience forms the most important factor, outweighing other aspects of human capital, and find which language factors in a r\'esum\'e have significant effects. This lays the groundwork for future inquiry in career trajectories using large scale data analysis and natural language processing techniques.
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2106.11846&r=
  25. By: Alicia Aguilar (Banco de España); Diego Torres (Banco de España)
    Abstract: The use of quantitative tools to analyse the huge amount of qualitative information has been acquiring increasing importance. Market participants and, of course, Central Banks have been involved in this trend. The vast majority of qualitative data can be qualified as non-structured and refers mainly to news, reports or another kind of texts. Its transformation into structured data can improve the availability of information and hence, decision making. This article applies sentiment analysis tools to text data in order to quantify the impact of Covid-19 on the analysts’ opinions. Using this methodology, it is possible to transform qualitative non-structured data into a quantitative index that can be used to compare reports from different periods and countries. The results show the pandemic worsens banking sentiment in Europe, which coincides with higher uncertainty in the stock market. There are also regional differences in the decline in sentiment as well as higher divergence is observed across opinions.
    Keywords: Sentiment analysis, COVID-19 impact, European banking, analysts’ estimates
    JEL: G21 C81 D8 C43
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:bde:wpaper:2124&r=
  26. By: Luisa Loiacono (Università di Parma e Università di Ferrara); Riccardo Puglisi (Università di Pavia); Leonzio Rizzo (Università di Ferrara e Institut d'Economia Barcelona); Riccardo Secomandi (Università di Ferrara)
    Abstract: The spread of COVID-19 led countries around the world to adopt lockdown measures of varying stringency to restrict movement of people. However, the effectiveness of these measures on mobility has been markedly different. Employing a difference-in-differences design and a set of robustness checks, we analyse the effectiveness of movement restrictions across different countries. We disentangle the role of regulation (stringency measures) from the role of people’s perception about the spread of COVID-19. We proxy the COVID-19 perception by using Google Trends data on the term “Covid”. We find that lockdown measures have a higher impact on mobility the more people perceive the severity of COVID-19 pandemic. This finding is driven by countries with low level of trust in institutions.
    Keywords: mobility, lockdown measures, COVID-19, stringency index, perception, public health, public policy
    JEL: D7 E7 I18
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:ipu:wpaper:104&r=
  27. By: Francesco Carbonero; Jeremy Davies; Ekkehard Ernst; Sayantan Ghosal; Leaza McSorley
    Abstract: Anxiety is a negative emotion experienced today in response to future risk. We model how anxiety impacts on risky investment when expected future returns interact with multiple narratives and strategic uncertainty today. A novel anxiety index is constructed via a sentiment analysis of daily online articles in the Daily Mail, Reuters and Press Association; its plausibility is established by comparing it to the corresponding ONS measure over the Covid-19 pandemic. A SVAR analysis shows that anxiety impacts negatively on stock market volatility, a model prediction. We discuss the welfare implications of lighthouse policies focusing on Brexit and the pandemic.
    Keywords: anxiety, investment, uncertainty, strategy, narratives, sentiment, lighthouse, policy, coronacrisis, Brexit
    JEL: C72 D91 E21 E22 I30
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:gla:glaewp:2021_12&r=
  28. By: Annie Portelance
    Abstract: A central bank’s ability to measure the impact of its communications is nothing less than challenging. This is for several reasons: (i) the general public is a vastly broad audience with varying degrees of knowledge of, interest in and engagement with economics and central banking; (ii) some communications goals—such as building trust—take a significant amount of time; (iii) results from communications efforts are often intangible and difficult to measure; and (iv) many communications outcomes are influenced by broader social factors that are beyond a central bank’s control. The Bank of Canada’s Communications Department has developed a framework to quantify and qualify the Bank’s communications efforts and their results. Using data-based measurement and evaluation, the department can assess the impact of the Bank’s communications activities and gauge the department’s contribution to the Bank’s overall goals. These measurement and evaluation activities have contributed significantly to the Communications Department’s work, informing both strategic and tactical decisions. The use of measurement and evaluation brings a fresh perspective and enriches the practice of strategic communications—in a sense, integrating science into an established art. The Bank’s framework provides a solid foundation upon which measurement and evaluation approaches can stand securely as they evolve.
    Keywords: Central bank research; Credibility; Monetary policy communications
    JEL: D8 D83
    Date: 2021–06
    URL: http://d.repec.org/n?u=RePEc:bca:bocadp:21-9&r=

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.