nep-big 2019-01-28 papers

on Big Data

Issue of 2019‒01‒28
eleven papers chosen by
Tom Coupé
University of Canterbury

Separating the signal from the noise - financial machine learning for Twitter By Schnaubelt, Matthias; Fischer, Thomas G.; Krauss, Christopher
An evaluation of early warning models for systemic banking crises: Does machine learning improve predictions? By Beutel, Johannes; List, Sophia; von Schweinitz, Gregor
Does Scientific Progress Affect Culture? A Digital Text Analysis By Michela Giorcelli; Nicola Lacetera; Astrid Marinoni
Big data analytics and business failures in data-Rich environments: An organizing framework. By Amankwah-Amoah, Joseph
Robustness of Support Vector Machines in Algorithmic Trading on Cryptocurrency Market By Maryna Zenkova; Robert Ślepaczuk
Historical Analysis of National Subjective Wellbeing using millions of Digitized Books By Hills, Thomas; Proto, Eugenio; Sgroi, Daniel
Political Competition: How to Measure Party Strategy in Direct Voter Communication using Social Media Data? By Sturm, Silke
Protection and Profit: Empirical Evidence of Governmental and Market-based Forest Policies By Julika Herzberg
Weapon-Carrying among High School Students: A Predictive Model Using Machine Learning By Yiran Fan
Media Sentiment and International Asset Prices By Samuel P. Fraiberger; Do Lee; Damien Puy; Romain Ranciere
Folklore By Stelios Michalopoulos; Melanie Meng Xue

Separating the signal from the noise - financial machine learning for Twitter

By:	Schnaubelt, Matthias; Fischer, Thomas G.; Krauss, Christopher
Abstract:	Most statistical arbitrage strategies in the academic literature soley rely on price time series. By contrast, alternative data sources are of growing importance for professional investors. We contribute to bridging this gap by assessing the price-predictive value of more than nine million tweets on intraday returns of the S&P 500 constituents. For this purpose, we design a machine learning pipeline addressing specific challenges inherent to this task. At first, we engineer domain-specific features along three categories, i.e., directional indicators, relevance indicators and meta features. Next, we leverage a random forest to extract the relationship between these features and subsequent stock returns in a low signal-to-noise setting. For performance evaluation, we run a rigorous eventbased backtesting study across all tweets and stocks. We find annualized returns of 6.4 percent and a Sharpe ratio of 2.2 after transaction costs. Finally, we illuminate the machine learning black box and unveil sources of profitability: First, results are both driven and limited by the temporal clustering of tweets, i.e., the majority of profits stem from tweets clustered closely together in time, corresponding to high-event situations. Second, the importance of included features follows an economic rationale, e.g., tweets with positive sentiment tend to yield positive returns and vice versa. Third, we find that stocks of medium market capitalization and from the consumer and technology sectors contribute most to our results, which we interpret as a trade-off between tweet coverage and tweet relevance.
Keywords:	finance,statistical arbitrage,machine learning,random forests,trading strategy backtesting,social media
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:zbw:iwqwdp:142018&r=all

An evaluation of early warning models for systemic banking crises: Does machine learning improve predictions?

By:	Beutel, Johannes; List, Sophia; von Schweinitz, Gregor
Abstract:	This paper compares the out-of-sample predictive performance of different early warning models for systemic banking crises using a sample of advanced economies covering the past 45 years. We compare a benchmark logit approach to several machine learning approaches recently proposed in the literature. We find that while machine learning methods often attain a very high in-sample fit, they are outperformed by the logit approach in recursive out-of-sample evaluations. This result is robust to the choice of performance measure, crisis definition, preference parameter, and sample length, as well as to using different sets of variables and data transformations. Thus, our paper suggests that further enhancements to machine learning early warning models are needed before they are able to offer a substantial value-added for predicting systemic banking crises. Conventional logit models appear to use the available information already fairly effciently, and would for instance have been able to predict the 2007/2008 financial crisis out-of-sample for many countries. In line with economic intuition, these models identify credit expansions, asset price booms and external imbalances as key predictors of systemic banking crises.
Keywords:	early warning system,logit,machine learning,systemic banking crises
JEL:	C35 C53 G01
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:zbw:iwhdps:22019&r=all

Does Scientific Progress Affect Culture? A Digital Text Analysis

By:	Michela Giorcelli; Nicola Lacetera; Astrid Marinoni
Abstract:	We study the interplay between scientific progress and culture through text analysis on a corpus of about eight million books, with the use of techniques and algorithms from machine learning. We focus on a specific scientific breakthrough, the theory of evolution through natural selection by Charles Darwin, and examine the diffusion of certain key concepts that characterized this theory in the broader cultural discourse and social imaginary. We find that some concepts in Darwin’s theory, such as Evolution, Survival, Natural Selection and Competition diffused in the cultural discourse immediately after the publication of On the Origins of Species. Other concepts such as Selection and Adaptation were already present in the cultural dialogue. Moreover, we document semantic changes for most of these concepts over time. Our findings thus show a complex relation between two key factors of long-term economic growth – science and culture. Considering the evolution of these two factors jointly can offer new insights to the study of the determinants of economic development, and machine learning is a promising tool to explore these relationships.
JEL:	N00 O30 Z1
Date:	2019–01
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:25429&r=all

Big data analytics and business failures in data-Rich environments: An organizing framework.

By:	Amankwah-Amoah, Joseph
Abstract:	In view of the burgeoning scholarly works on big data and big data analytical capabilities, there remains limited research on how different access to big data and different big data analytic capabilities possessed by firms can generate diverse conditions leading to business failure. To fill this gap in the existing literature, an integrated framework was developed that entailed two approaches to big data as an asset (i.e. threshold resource and distinctive resource) and two types of competences in big data analytics (i.e. threshold competence and distinctive/core competence). The analysis provides insights into how ordinary big data analytic capability and mere possession of big data are more likely to create conditions for business failure. The study extends the existing streams of research by shedding light on decisions and processes in facilitating or hampering firms’ ability to harness big data to mitigate the cause of business failures. The analysis led to the categorisation of a number of fruitful avenues for research on data-driven approaches to business failure.
Keywords:	big data analytics; technology; innovation management; big data; business failure
JEL:	L1 L2
Date:	2019–01–01
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:91264&r=all

Robustness of Support Vector Machines in Algorithmic Trading on Cryptocurrency Market

By:	Maryna Zenkova (Quantitative Finance Research Group, Faculty of Economic Sciences, University of Warsaw); Robert Ślepaczuk (Quantitative Finance Research Group, Faculty of Economic Sciences, University of Warsaw)
Abstract:	This study investigates the profitability of a algorithmic trading strategy based on training SVM model to identify cryptocurrencies with high or low predicted returns. A tail set is defined to be a group of coins whose volatility-adjusted returns are in the highest or lowest quantile. Each cryptocurrency is represented by a set of six technical features. SVM is trained on historical tail sets and tested on the current data. The classifier is chosen to be a nonlinear support vector machine. Portfolio is formed by ranking coins using SVM output. The highest ranked coins are used for long positions to be included in the portfolio for one reallocation period. The following metrics were used to estimate the portfolio profitability: %ARC (the annualized rate of change), %ASD (the annualized standard deviation of daily returns), MDD (the maximum drawdown coefficient), IR1, IR2 (the information ratio coefficients). The performance of the SVM portfolio is compared to the performance of the four benchmark strategies based on the values of the information ratio coefficient IR1 which quantifies the risk-weighted gain. The question on how sensitive the portfolio performance is to the parameters set in the SVM model is also addressed in this study.
Keywords:	machine learning, support vector machines, investment algorithm, algorithmic trading, strategy, optimization, cross-validation, overfitting, cryptocurrency market, technical analysis, meta parameters
JEL:	C4 C45 C61 C15 G14 G17
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:war:wpaper:2019-02&r=all

Historical Analysis of National Subjective Wellbeing using millions of Digitized Books

By:	Hills, Thomas (Department of Psychology, University of Warwick); Proto, Eugenio (Department of Economics, University of Warwick, CAGE and IZA); Sgroi, Daniel (Department of Economics, University of Warwick, CAGE and Nuﬃeld College, University of Oxford)
Abstract:	We develop a new way to measure national subjective well-being across the very long run where traditional survey data on well-being is not available. Our method is based on quantitative analysis of digitized text from millions of books published over the past 200 years, long before the widespread availability of consistent survey data. The method uses psychological valence norms for thousands of words in diﬀerent languages to compute the relative proportion of positive and negative language for four diﬀerent nations (the USA, UK, Germany and Italy). We validate our measure against existing survey data from the 1970s onwards (when such data became available) showing that our measure is highly correlated with surveyed life satisfaction. We also validate our measure against historical trends in longevity and GDP (showing a positive relationship) and conﬂict (showing a negative relationship). Our measure allows a ﬁrst look at changes in subjective well-being over the past two centuries, for instance highlighting the dramatic fall in well-being during the two World Wars and rise in relation to longevity.
Keywords:	historical subjective well-being ; language ; big data ; GDP ; conﬂict
JEL:	N3 N4 O1 D6
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:wrk:warwec:1186&r=all

Political Competition: How to Measure Party Strategy in Direct Voter Communication using Social Media Data?

By:	Sturm, Silke
Abstract:	Political competition, party strategy and communication in the era of social media are growing issues. Due to the increasing social media presence of parties and voters alike, direct communication is more important for party competition. This paper aims to improve the methodological approach used to analyze political competition and communication. The dataset includes over 30,000 Facebook status messages posted by seven German parties from January 2014 until February 2018. Topic modeling, which is commonly used in other fields, allows for evaluating party communication on a daily basis. The results show the high accuracy of calculating party-relevant issues. To determine the tone of the debate, a sentiment analysis was conducted. The prevalence of topics and sentiments over time allows for precise monitoring of the political debate.
Keywords:	Political competition,Party strategy,Decision making,Social media,Topic models,Sentiment analysis
JEL:	C81 D72 D83 D91
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:zbw:uhhhdp:1&r=all

Protection and Profit: Empirical Evidence of Governmental and Market-based Forest Policies

By:	Julika Herzberg (Aachen university)
Abstract:	In this paper, I study the effectiveness of privately managed FSC certified forests and public sustainability reserves distributed over the entire Brazilian Amazon from 2002-2015. The paper uses high-resolution data on forest cover derived from satellite images and organized in a grid of 1 km2 cells. Using a difference-in-differences estimator in a regression discontinuity environment, I find an increase in deforestation of an annual area of 8,057 ha in FSC forests after the certification. Public sustainability zones' impact on deforestation is also positive but declines over time. The effectiveness of both type of zones improves if they are located closer to (export) markets or existing infrastructure.
Keywords:	deforestation, commodity prices, sustainable forest management
JEL:	J43 O13 O14 Q15 Q17
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:mar:magkse:201901&r=all

Weapon-Carrying among High School Students: A Predictive Model Using Machine Learning

By:	Yiran Fan (The Linsly School, Wheeling, WV, USA)
Abstract:	This study is aimed at 1) identifying the predictors for weapon-carrying on school properties; 2) build a predictive model for parents, educators, and pediatricians for early intervention. Youth Risk Behavior Surveillance System (YRBSS) 2017 data were used for this study. Logistic regression model is used to calculate the predicted risk. Logistic regression is a part of a category of statistical models called generalized linear models, and it allows one to predict a discrete outcome from a set of variables that may be continuous, discrete, dichotomous, or a combination of these. Typically, the dependent variable is dichotomous and the independent variables are either categorical or continuous. The data is run through R program. The outcome variable is weapon-carrying based Q13 (During the past 30 days, on how many days did you carry a weapon such as a gun, knife, or club on school property?) The result identified several important predictors for carrying weapon on school properties, such as gender, alcohol use, and smoking age. This provided important information for the educators and parents for early intervention and alleviating the negative effects of weapon-carrying among teenagers.
Keywords:	weapon, school, educators
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:smo:jpaper:051yf&r=all

Media Sentiment and International Asset Prices

By:	Samuel P. Fraiberger; Do Lee; Damien Puy; Romain Ranciere
Abstract:	We assess the impact of media sentiment on international equity prices using more than 4.5 million Reuters articles published across the globe between 1991 and 2015. News sentiment robustly predicts daily returns in both advanced and emerging markets, even after controlling for known determinants of stock prices. But not all news-sentiment is alike. A local (country-specific) increase in news optimism (pessimism) predicts a small and transitory increase (decrease) in local returns. By contrast, changes in global news sentiment have a larger impact on equity returns around the world, which does not reverse in the short run. We also find evidence that news sentiment affects mainly foreign – rather than local – investors: although local news optimism attracts international equity flows for a few days, global news optimism generates a permanent foreign equity inflow. Our results confirm the value of media content in capturing investor sentiment.
Keywords:	International financial markets;Capital flows;Asset Pricing, Behavioral Finance, Investor Sentiment, News Media, Natural Language Processing
Date:	2018–12–10
URL:	http://d.repec.org/n?u=RePEc:imf:imfwpa:18/274&r=all

Folklore

By:	Stelios Michalopoulos; Melanie Meng Xue
Abstract:	Folklore is the collection of traditional beliefs, customs, and stories of a community, passed through the generations by word of mouth. This vast expressive body, studied by the corresponding discipline of folklore, has evaded the attention of economists. In this study we do four things that reveal the tremendous potential of this corpus for understanding comparative development and culture. First, we introduce and describe a unique catalogue of folklore that codes the presence of thousands of motifs for roughly 1,000 pre-industrial societies. Second, we use a dictionary-based approach to elicit group-specific measures of various traits related to the natural environment, institutional framework, and mode of subsistence. We establish that these proxies are in accordance with the ethnographic record, and illustrate how to use a group’s oral tradition to quantify non-extant characteristics of preindustrial societies. Third, we use folklore to uncover the historical cultural values of a group. Doing so allows us to test various influential conjectures among social scientists including the original affluent society, the culture of honor among pastoralists, the role of family in extended kinship systems and the intensity of trade and rule-following norms in politically centralized group. Finally, we explore how cultural norms inferred via text analysis of oral traditions predict contemporary attitudes and beliefs.
JEL:	N00 Z1 Z13
Date:	2019–01
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:25430&r=all

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.