nep-big 2018-12-17 papers

on Big Data

Issue of 2018‒12‒17
thirteen papers chosen by
Tom Coupé
University of Canterbury

Robust Classification of Financial Risk By Suproteem K. Sarkar; Kojin Oshiba; Daniel Giebisch; Yaron Singer
Google econometrics: nowcasting euro area car sales and big data quality requirements By Nymand-Andersen, Per; Pantelidis, Emmanouil
BDLOB: Bayesian Deep Convolutional Neural Networks for Limit Order Books By Zihao Zhang; Stefan Zohren; Stephen Roberts
Idiosyncrasies and challenges of data driven learning in electronic trading By Vangelis Bacoyannis; Vacslav Glukhov; Tom Jin; Jonathan Kochems; Doo Re Song
Machine Learning for Yield Curve Feature Extraction: Application to Illiquid Corporate Bonds By Greg Kirczenow; Masoud Hashemi; Ali Fathi; Matt Davison
Churn and loyalty behaviour of Turkish digital natives By Guven, Faruk
Regional Output Growth in the United Kingdom: More Timely and Higher Frequency Estimates, 1970-2017 By Gary Koop; Stuart McIntyre; James Mitchell; Aubrey Poon
Is data the new oil? Diminishing returns to scale By Arnold, René; Marcus, J. Scott; Petropoulos, Georgios; Schneider, Anna
SENTIMENT ANALYSIS AND CLASSIFICATION OF TWEETS BASED ON IDEOLOGIES By Rajani; Pankaj Sambyal; Shalini Sharma
Predicting future stock market structure by combining social and financial network information By Th\'arsis T. P. Souza; Tomaso Aste
Italian happy parents In Twitter By Letizia Mencarini; Delia Irazú Hernández-Farías; Mirko Lai; Viviana Patti; Emilio Sulis; Daniele Vignoli
Stigma or Cushion? IMF Programs and Sovereign Creditworthiness By Kai Gehring; Valentin F. Lang
Squeezing More Out of Your Data: Business Record Linkage with Python By John Cuffe; Nathan Goldschlag

By:	Suproteem K. Sarkar; Kojin Oshiba; Daniel Giebisch; Yaron Singer
Abstract:	Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial services. We use the loan grade classification problem to explore how machine learning models are sensitive to small changes in user-reported data, using adversarial attacks documented in the literature and an original, domain-specific attack. Our work shows that a robust optimization algorithm can build models for financial services that are resistant to misclassification on perturbations. To the best of our knowledge, this is the first study of adversarial attacks and defenses for deep learning in financial services.
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1811.11079&r=big

Google econometrics: nowcasting euro area car sales and big data quality requirements

By:	Nymand-Andersen, Per; Pantelidis, Emmanouil
Abstract:	Big data” is becoming an increasingly important aspect of our daily lives as the digital sources of information and intelligence that it encompasses become more structured and more publicly available. These sources may enable the generation of new datasets providing high-frequency and timely insights into unconscious digital behaviour and the consequent actions of economic agents, which may, in turn, assist in the generation of early indicators of economic and financial trends and activities. This paper examines the usefulness of Google search data in nowcasting euro area car sales, as a leading macroeconomic indicator, and considers the quality requirements for using these new data sources as a toolkit for sound decision and policy making. The paper finds that, while Google data may have predictive capabilities for nowcasting euro area car sales, further quality improvements in the data source are needed in order to move beyond experimental statistics. If these quality requirements can be met, the resulting advances in theory and knowledge around interpreting big data can be expected to significantly re-shape how we think about and explain both behaviour and complex socio-economic phenomena. JEL Classification: C53, C82, E58, E71
Keywords:	big data, google internet search, modelling, nowcasting, quality, statistics, vector auto regression
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:ecb:ecbsps:201830&r=big

BDLOB: Bayesian Deep Convolutional Neural Networks for Limit Order Books

By:	Zihao Zhang; Stefan Zohren; Stephen Roberts
Abstract:	We showcase how dropout variational inference can be applied to a large-scale deep learning model that predicts price movements from limit order books (LOBs), the canonical data source representing trading and pricing movements. We demonstrate that uncertainty information derived from posterior predictive distributions can be utilised for position sizing, avoiding unnecessary trades and improving profits. Further, we test our models by using millions of observations across several instruments and markets from the London Stock Exchange. Our results suggest that those Bayesian techniques not only deliver uncertainty information that can be used for trading but also improve predictive performance as stochastic regularisers. To the best of our knowledge, we are the first to apply Bayesian networks to LOBs.
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1811.10041&r=big

Idiosyncrasies and challenges of data driven learning in electronic trading

By:	Vangelis Bacoyannis; Vacslav Glukhov; Tom Jin; Jonathan Kochems; Doo Re Song
Abstract:	We outline the idiosyncrasies of neural information processing and machine learning in quantitative finance. We also present some of the approaches we take towards solving the fundamental challenges we face.
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1811.09549&r=big

Machine Learning for Yield Curve Feature Extraction: Application to Illiquid Corporate Bonds

By:	Greg Kirczenow; Masoud Hashemi; Ali Fathi; Matt Davison
Abstract:	This paper studies an application of machine learning in extracting features from the historical market implied corporate bond yields. We consider an example of a hypothetical illiquid fixed income market. After choosing a surrogate liquid market, we apply the Denoising Autoencoder (DAE) algorithm to learn the features of the missing yield parameters from the historical data of the instruments traded in the chosen liquid market. The DAE algorithm is then challenged by two "point-in-time" inpainting algorithms taken from the image processing and computer vision domain. It is observed that, when tested on unobserved rate surfaces, the DAE algorithm exhibits superior performance thanks to the features it has learned from the historical shapes of yield curves.
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1812.01102&r=big

Churn and loyalty behaviour of Turkish digital natives

By:	Guven, Faruk
Abstract:	The mobile industry drives innovation and economic growth all over the world, thanks to the application industry, content providers and mobile handset producers and operators. With a very young and dynamic population and a remarkable 96 percent mobile penetration rate, Turkey is no stranger to this trend. Yes, this very segment – commonly referred to as digital natives – also poses a dilemma for telecom operators: a high rate of attrition or the churn phenomenon. This paper reports on an empirical examination of the churn and loyalty characteristics of digital natives in the Turkish context. We employ a large sample of youth and analyse their churn and loyalty likelihoods. Overall, we find convincing evidence that, by having a consumer-centric approach and knowing more about their individual customers, telecom operators can drive loyalty and prevent churn. The rise of big data and sophisticated analysis based on behavioral patterns implies that operators now have the tools they need to predict consumer behavior better than ever before.
Keywords:	Digital Natives,Churn,Loyalty,Consumer Behavior,Turkey,Mobile Services
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:zbw:itse18:184943&r=big

Regional Output Growth in the United Kingdom: More Timely and Higher Frequency Estimates, 1970-2017

By:	Gary Koop; Stuart McIntyre; James Mitchell; Aubrey Poon
Abstract:	Output growth estimates for the regions of the UK are currently published at the annual frequency only and are released with a long delay. Regional economists and policymakers would benefit from having higher frequency and more timely estimates. In this paper we develop a mixed frequency Vector Autoregressive (MF-VAR) model and use it to produce estimates of quarterly regional output growth. Temporal and cross-sectional restrictions are imposed in the model to ensure that our quarterly regional estimates are consistent with the annual regional observations and the observed quarterly UK totals. We use a machine learning method based on the hierarchical Dirichlet-Laplace prior to ensure optimal shrinkage and parsimony in our over-parameterised MF-VAR. Thus,this paper presents a new, regional quarterly database of nominal and real Gross Value Added dating back to 1970. We describe how we update and evaluate these estimates on an ongoing, quarterly basis to publish online (at www.escoe.ac.uk/regionalnowcasting) more timely estimates of regional economic growth. We illustrate how the new quarterly data can be used to contribute to our historical understanding of business cycle dynamics and connectedness between regions.
Keywords:	Regional data, Mixed frequency, Nowcasting, Bayesian methods, Realtime data, Vector autoregressions
JEL:	C32 C53 E37
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:nsr:escoed:escoe-dp-2018-14&r=big

Is data the new oil? Diminishing returns to scale

By:	Arnold, René; Marcus, J. Scott; Petropoulos, Georgios; Schneider, Anna
Abstract:	A key advantage of online advertising over offline is that online advertising can, with sufficient data, be far more accurately targeted than traditional advertising. But how much data is enough? The empirical literature tends to suggest that there are indeed economies of scale in using data for market targeting, but that these benefits are subject to diminishing returns in a static perspective. Is there a plateau, and is it perhaps very large? It is clear that a certain amount of data is necessary to identify meaningful consumer segments and to offer targeted advertising space as part of an advertising campaign; however, a simple correlation between the volume of data gathered by an advertiser and the return on investment of an advertising campaign neglects the complexity of advertising effectiveness. We provide a general assessment of key elements of the literature on economies of scale in the use of data for online advertising, and then seek to link these to the general literature on market targeting in order to provide insights as to the factors that limit effectiveness in using big data for market targeting.
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:zbw:itse18:184927&r=big

SENTIMENT ANALYSIS AND CLASSIFICATION OF TWEETS BASED ON IDEOLOGIES

By:	Rajani (Kalindi College); Pankaj Sambyal (Kalindi College`); Shalini Sharma (Kalindi College)
Abstract:	In this paper, we use data mining and sentiment analysis techniques to classify the tweets based on different ideologies i.e. Secularism, Liberalism, Communalism, Socialism and Casteism. To analyze our model, we used tweets from three sources namely generic Indian tweets, a specific user profile tweet and tweets of particular hashtags.The tweets are fetched using Twitter API. The fetched data is preprocessed by analyzing structure of tweets to find interesting analysis like most retweeted tweet, most favorited tweets, trending hashtags etc. Then tweets are tokenized and POS (parts of speech) tagging is done on tokens to find nouns, verbs, adverbs and adjectives which are relevant for the analysis.We apply various relevance models on the data, to find sentiment of each tweet and ideological stance of the user. The results are shown using spider graph. It was observed that the model worked with 73% accuracy.
Keywords:	Multiclass, data mining, twitter, hashtag
Date:	2018–10
URL:	http://d.repec.org/n?u=RePEc:sek:iacpro:7010170&r=big

Predicting future stock market structure by combining social and financial network information

By:	Th\'arsis T. P. Souza; Tomaso Aste
Abstract:	We demonstrate that future market correlation structure can be predicted with high out-of-sample accuracy using a multiplex network approach that combines information from social media and financial data. Market structure is measured by quantifying the co-movement of asset prices returns, while social structure is measured as the co-movement of social media opinion on those same assets. Predictions are obtained with a simple model that uses link persistence and link formation by triadic closure across both financial and social media layers. Results demonstrate that the proposed model can predict future market structure with up to a 40\% out-of-sample performance improvement compared to a benchmark model that assumes a time-invariant financial correlation structure. Social media information leads to improved models for all settings tested, particularly in the long-term prediction of financial market structure. Surprisingly, financial market structure exhibited higher predictability than social opinion structure.
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1812.01103&r=big

Italian happy parents In Twitter

By:	Letizia Mencarini; Delia Irazú Hernández-Farías; Mirko Lai; Viviana Patti; Emilio Sulis; Daniele Vignoli
Abstract:	his article explores opinions and semantic orientation around fertility and parenthood by scrutinizing filtered Italian Twitter data. We propose a novel methodological framework relying on Natural Language Processing techniques for text analysis, which is aimed at extracting sentiments from texts. A manual annotation for exploring sentiment and attitudes to fertility and parenthood was applied to Twitter data. The resulting set of tweets (corpus) was analysed through sentiment and emotion lexicons in order to highlight how affective language is used in this domain. It emerges that parents express a generally positive attitude towards their children and being and become parents, but quite negative sentiments on children’s future, politics and fertility and also parental behaviour. Exploiting geographical information from tweets, we find a significant correlation between the prevalence of positive sentiments about parenthood and macro-regional indicators for both life satisfaction and fertility levels.
Keywords:	sentiment analysis, social media, fertility, parenthood, subjective well-being, linguistic corpora
Date:	2018–04
URL:	http://d.repec.org/n?u=RePEc:don:donwpa:117&r=big

Stigma or Cushion? IMF Programs and Sovereign Creditworthiness

By:	Kai Gehring; Valentin F. Lang
Abstract:	IMF programs are often considered to carry a “stigma” that triggers adverse market reactions. We show that such a negative IMF effect disappears when accounting for endogenous selection into programs. To proxy for a country’s access to financial markets, we use credit ratings and investor assessments for 100 countries from 1987 to 2013. Our first identification strategy exploits the differential effect of changes in IMF liquidity on loan allocation. We find that the IMF can “cushion” against falling creditworthiness, despite contractionary adjustments resulting from its programs. A second, event-based strategy using country-times-year fixed effects supports this positive signaling effect. A supplementary text analysis of rating statements validates that agencies perceive IMF programs as positive, particularly when they are associated with reform commitments.
Keywords:	International Monetary Fund, sovereign credit ratings, capital market accss, creditworthiness, financial crises
JEL:	E44 F33 F34 G24
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:ces:ceswps:_7339&r=big

Squeezing More Out of Your Data: Business Record Linkage with Python

By:	John Cuffe; Nathan Goldschlag
Abstract:	Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
Date:	2018–11
URL:	http://d.repec.org/n?u=RePEc:cen:wpaper:18-46&r=big

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.