nep-big 2018-05-21 papers

on Big Data

Issue of 2018‒05‒21
nine papers chosen by
Tom Coupé
University of Canterbury

Deciphering Professional Forecasters’ Stories - Analyzing a Corpus of Textual Predictions for the German Economy By Ulrich Fritsche; Johannes Puckelwald
Honesty in the Digital Age By Alain Cohn; Tobias Gesche; Michel André Maréchal
Ensemble Learning for Cross-Selling Using Multitype Multiway Data By Zhongxia (Shelly) Ye
Sentiment-Based Prediction of Alternative Cryptocurrency Price Fluctuations Using Gradient Boosting Tree Model By Tianyu Ray Li; Anup S. Chamrajnagar; Xander R. Fong; Nicholas R. Rizik; Feng Fu
Piecewise Solutions to Big Data By Henry Chacon; Anuradha Roy
Big Data in Finance and the Growth of Large Firms By Juliane Begenau; Maryam Farboodi; Laura Veldkamp
Record linkage in the Cape of Good Hope Panel By Auke Rijpma; Jeanne Cilliers; Johan Fourie
Learning from man or machine: Spatial aggregation and house price prediction By Sommervoll, Dag Einar; Sommervoll, Åvald
Finding Needles in Haystacks: Artificial Intelligence and Recombinant Growth By Ajay Agrawal; John McHale; Alex Oettl

Deciphering Professional Forecasters’ Stories - Analyzing a Corpus of Textual Predictions for the German Economy

By:	Ulrich Fritsche (Universität Hamburg (University of Hamburg)); Johannes Puckelwald (Universität Hamburg (University of Hamburg))
Abstract:	We analyze a corpus of 564 business cycle forecast reports for the German economy. The dataset covers nine institutions and 27 years. From the entire reports we select the parts that refer exclusively to the forecast of the German economy. Sentiment and frequency analysis confirm that the mode of the textual expressions varies with the business cycle in line with the hypothesis of adaptive expectations. A calculated 'uncertainty index' based on the occurrence of modal words matches with the economic policy uncertainty index by Baker et al. (2016). The latent Dirichlet allocation (LDA) model and the structural topic model (STM) indicate that topics are significantly state- and time-dependent and different across institutions. Positive or negative forecast 'surprises' experienced in the previous year have an impact on the content of topics.
Keywords:	Sentiment analysis, text analysis, uncertainty, business cycle forecast, forecast error, expectation, adaptive expectation, latent Dirichlet allocation, structural topic model
JEL:	E32 E37 C49
Date:	2018–05
URL:	http://d.repec.org/n?u=RePEc:hep:macppr:201804&r=big

Honesty in the Digital Age

By:	Alain Cohn; Tobias Gesche; Michel André Maréchal
Abstract:	Modern communication technologies enable efficient exchange of information, but often sacrifice direct human interaction inherent in more traditional forms of communication. This raises the question of whether the lack of personal interaction induces individuals to exploit informational asymmetries. We conducted two experiments with 866 subjects to examine how human versus machine interaction influences cheating for financial gain. We find that individuals cheat significantly more when they interact with a machine rather than a person, regardless of whether the machine is equipped with human features. When interacting with a human, individuals are particularly reluctant to report unlikely favorable outcomes, which is consistent with social image concerns. The second experiment shows that dishonest individuals prefer to interact with a machine when facing an opportunity to cheat. Our results suggest that human interaction is key to mitigating dishonest behavior and that self-selection into communication channels can be used to screen for dishonest people.
Keywords:	cheating, honesty, private information, communication, digitization, lying costs
JEL:	C99 D82 D83
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:ces:ceswps:_6996&r=big

Ensemble Learning for Cross-Selling Using Multitype Multiway Data

By:	Zhongxia (Shelly) Ye (UTSA)
Abstract:	Cross-selling is an integral component of customer relationship management. Using relevant information to improve customer response rate is a challenging task in cross-selling. Incorporating multitype multiway customer behavioral, including related product, similar customer and historical promotion, data into cross-selling models is helpful in improving the classification performance. Customer behavioral data can be represented by multiple high-order tensors. Most existing supervised tensor learning methods cannot directly deal with heterogeneous and sparse multiway data in cross selling. In this study, two novel ensemble learning methods, multiple kernel support tensor machine (MK-STM) and multiple support vector machine ensemble (M-SVM-E), are proposed for crossselling using multitype multiway data. The MK-STM and the M-SVM-E can also perform feature selections from large sparse multitype multiway data. Based on these two methods, collaborative and non-collaborative ensemble learning frameworks are developed. In these frameworks, many existing classification and ensemble methods can be combined for classification using multitype multiway data. Computational experiments are conducted on two databases extracted from open access databases. The experimental results show that the MK-STM exhibits the best performance and has better performance than existing supervised tensor learning methods.
Keywords:	Audit committees, voluntary disclosures, director elections, auditor ratification
JEL:	M42
Date:	2018–01–18
URL:	http://d.repec.org/n?u=RePEc:tsa:wpaper:0157acc&r=big

Sentiment-Based Prediction of Alternative Cryptocurrency Price Fluctuations Using Gradient Boosting Tree Model

By:	Tianyu Ray Li; Anup S. Chamrajnagar; Xander R. Fong; Nicholas R. Rizik; Feng Fu
Abstract:	In this paper, we analyze Twitter signals as a medium for user sentiment to predict the price fluctuations of a small-cap alternative cryptocurrency called \emph{ZClassic}. We extracted tweets on an hourly basis for a period of 3.5 weeks, classifying each tweet as positive, neutral, or negative. We then compiled these tweets into an hourly sentiment index, creating an unweighted and weighted index, with the latter giving larger weight to retweets. These two indices, alongside the raw summations of positive, negative, and neutral sentiment were juxtaposed to $\sim 400$ data points of hourly pricing data to train an Extreme Gradient Boosting Regression Tree Model. Price predictions produced from this model were compared to historical price data, with the resulting predictions having a 0.81 correlation with the testing data. Our model's predictive data yielded statistical significance at the $p
Date:	2018–05
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1805.00558&r=big

Piecewise Solutions to Big Data

By:	Henry Chacon; Anuradha Roy (UTSA)
Abstract:	Outliers in the financial market data often carry important information, which requires attention and investigation. Many outlier detection techniques, including both parametric and nonparametric, have been developed over the years which are specific to certain application domains. Nonetheless, outlier detection is not an easy task, because sometimes the occurrence of them is pretty easy and evident, but in some other times, it may be extremely cumbersome. Financial series, which are not only pretty sensitive in reflecting the world market conditions due to the interactions of a very large number of participants in its operation, but also influenced by other stock markets that operate in other parts of the world, produce a non-synchronous process. In this research, we detect the presence of outliers in financial time series over the S&P 500 during the year 2016. We detect the beginning of some shocks (outliers) such as the Brexit referendum and the United States Presidential election held in the year 2016. Generally, the impacts of these events were not drastic. Histogram time series was implemented over a daily closing price on intervals of five minutes for the S&P 500 index during 2015 and 2016. In this case, the linear dependency between days of atypical returns were analyzed on quantiles [0 ?? 40]% and [60 ?? 100]%, while Wassertein distance and an approximation of entropy were used to quantify the presence of instant shocks in the index.
Keywords:	Big data, Outlier detection, Financial market, Histogram time series, Entropy
JEL:	M10 P20 A19
Date:	2017–12–10
URL:	http://d.repec.org/n?u=RePEc:tsa:wpaper:o156mss&r=big

Big Data in Finance and the Growth of Large Firms

By:	Juliane Begenau; Maryam Farboodi; Laura Veldkamp
Abstract:	One of the most important trends in modern macroeconomics is the shift from small firms to large firms. At the same time, financial markets have been transformed by advances in information technology. We explore the hypothesis that the use of big data in financial markets has lowered the cost of capital for large firms, relative to small ones, enabling large firms to grow larger. Large firms, with more economic activity and a longer firm history offer more data to process. As faster processors crunch ever more data – macro announcements, earnings statements, competitors' performance metrics, export demand, etc. – large firms become more valuable targets for this data analysis. Once processed, that data can better forecast firm value, reduce the risk of equity investment, and thus reduce the firm's cost of capital. As big data technology improves, large firms attract a more than proportional share of the data processing, enabling large firms to invest cheaply and grow larger.
JEL:	E2 G1 D8
Date:	2018–04
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:24550&r=big

Record linkage in the Cape of Good Hope Panel

By:	Auke Rijpma (Department of History, Utrecht University); Jeanne Cilliers (Department of Economic History, Lund University); Johan Fourie (Department of Economics, Stellenbosch University)
Abstract:	In this paper we describe the record linkage procedure to create a panel from Cape Colony census returns, or opgaafrolle, for 1787--1828, a dataset of 42,354 household-level observations. Based on a subset of manually linked records, we first evaluate statistical models and deterministic algorithms to best identify and match households over time. By using household-level characteristics in the linking process and near-annual data, we are able to create high-quality links for 84 percent of the dataset. We compare basic analyses on the linked panel dataset to the original cross-sectional data, evaluate the feasibility of the strategy when linking to supplementary sources, and discuss the scalability of our approach to the full Cape panel.
Keywords:	census, machine learning, micro-data, record linkage, panel data, South Africa
JEL:	N01 C81
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:sza:wpaper:wpapers299&r=big

Learning from man or machine: Spatial aggregation and house price prediction

By:	Sommervoll, Dag Einar (Centre for Land Tenure Studies, Norwegian University of Life Sciences); Sommervoll, Åvald (Department of Informatics)
Abstract:	House prices vary with location. At the same time the border between two neighboring housing markets tends to be fuzzy. When we seek to explain or predict house prices we need to correct for spatial price variation. A much used way is to include neighborhood dummy variables. In general, it is not clear how to choose a spatial subdivision in the vast space of all possible spatial aggregations. We take a biologically inspired approach, where diﬀerent spatial aggregations mutate and recombine according to their explanatory power in a standard hedonic housing market model. We ﬁnd that the genetic algorithm consistently ﬁnds aggregations that outperform conventional aggregation both in and out of sample. A comparison of best aggregations of diﬀerent runs of the genetic algorithm shows that even though they converge to a similar high explanatory power, they tend to be genetically and economically diﬀerent. Diﬀerences tend to be largely conﬁned to areas with few housing market transactions.
Keywords:	House price prediction; Machine learning; Genetic algorithm; Spatial aggregation
JEL:	C45 R21 R31
Date:	2018–04–24
URL:	http://d.repec.org/n?u=RePEc:hhs:nlsclt:2018_004&r=big

Finding Needles in Haystacks: Artificial Intelligence and Recombinant Growth

By:	Ajay Agrawal; John McHale; Alex Oettl
Abstract:	Innovation is often predicated on discovering useful new combinations of existing knowledge in highly complex knowledge spaces. These needle-in-a-haystack type problems are pervasive in fields like genomics, drug discovery, materials science, and particle physics. We develop a combinatorial-based knowledge production function and embed it in the classic Jones growth model (1995) to explore how breakthroughs in artificial intelligence (AI) that dramatically improve prediction accuracy about which combinations have the highest potential could enhance discovery rates and consequently economic growth. This production function is a generalization (and reinterpretation) of the Romer/Jones knowledge production function. Separate parameters control the extent of individual-researcher knowledge access, the effects of fishing out/complexity, and the ease of forming research teams.
JEL:	O3 O33 O4
Date:	2018–04
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:24541&r=big

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.