nep-big New Economics Papers
on Big Data
Issue of 2018‒05‒21
nine papers chosen by
Tom Coupé
University of Canterbury

  1. Deciphering Professional Forecasters’ Stories - Analyzing a Corpus of Textual Predictions for the German Economy By Ulrich Fritsche; Johannes Puckelwald
  2. Honesty in the Digital Age By Alain Cohn; Tobias Gesche; Michel André Maréchal
  3. Ensemble Learning for Cross-Selling Using Multitype Multiway Data By Zhongxia (Shelly) Ye
  4. Sentiment-Based Prediction of Alternative Cryptocurrency Price Fluctuations Using Gradient Boosting Tree Model By Tianyu Ray Li; Anup S. Chamrajnagar; Xander R. Fong; Nicholas R. Rizik; Feng Fu
  5. Piecewise Solutions to Big Data By Henry Chacon; Anuradha Roy
  6. Big Data in Finance and the Growth of Large Firms By Juliane Begenau; Maryam Farboodi; Laura Veldkamp
  7. Record linkage in the Cape of Good Hope Panel By Auke Rijpma; Jeanne Cilliers; Johan Fourie
  8. Learning from man or machine: Spatial aggregation and house price prediction By Sommervoll, Dag Einar; Sommervoll, Åvald
  9. Finding Needles in Haystacks: Artificial Intelligence and Recombinant Growth By Ajay Agrawal; John McHale; Alex Oettl

  1. By: Ulrich Fritsche (Universität Hamburg (University of Hamburg)); Johannes Puckelwald (Universität Hamburg (University of Hamburg))
    Abstract: We analyze a corpus of 564 business cycle forecast reports for the German economy. The dataset covers nine institutions and 27 years. From the entire reports we select the parts that refer exclusively to the forecast of the German economy. Sentiment and frequency analysis confirm that the mode of the textual expressions varies with the business cycle in line with the hypothesis of adaptive expectations. A calculated 'uncertainty index' based on the occurrence of modal words matches with the economic policy uncertainty index by Baker et al. (2016). The latent Dirichlet allocation (LDA) model and the structural topic model (STM) indicate that topics are significantly state- and time-dependent and different across institutions. Positive or negative forecast 'surprises' experienced in the previous year have an impact on the content of topics.
    Keywords: Sentiment analysis, text analysis, uncertainty, business cycle forecast, forecast error, expectation, adaptive expectation, latent Dirichlet allocation, structural topic model
    JEL: E32 E37 C49
    Date: 2018–05
    URL: http://d.repec.org/n?u=RePEc:hep:macppr:201804&r=big
  2. By: Alain Cohn; Tobias Gesche; Michel André Maréchal
    Abstract: Modern communication technologies enable efficient exchange of information, but often sacrifice direct human interaction inherent in more traditional forms of communication. This raises the question of whether the lack of personal interaction induces individuals to exploit informational asymmetries. We conducted two experiments with 866 subjects to examine how human versus machine interaction influences cheating for financial gain. We find that individuals cheat significantly more when they interact with a machine rather than a person, regardless of whether the machine is equipped with human features. When interacting with a human, individuals are particularly reluctant to report unlikely favorable outcomes, which is consistent with social image concerns. The second experiment shows that dishonest individuals prefer to interact with a machine when facing an opportunity to cheat. Our results suggest that human interaction is key to mitigating dishonest behavior and that self-selection into communication channels can be used to screen for dishonest people.
    Keywords: cheating, honesty, private information, communication, digitization, lying costs
    JEL: C99 D82 D83
    Date: 2018
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_6996&r=big
  3. By: Zhongxia (Shelly) Ye (UTSA)
    Abstract: Cross-selling is an integral component of customer relationship management. Using relevant information to improve customer response rate is a challenging task in cross-selling. Incorporating multitype multiway customer behavioral, including related product, similar customer and historical promotion, data into cross-selling models is helpful in improving the classification performance. Customer behavioral data can be represented by multiple high-order tensors. Most existing supervised tensor learning methods cannot directly deal with heterogeneous and sparse multiway data in cross selling. In this study, two novel ensemble learning methods, multiple kernel support tensor machine (MK-STM) and multiple support vector machine ensemble (M-SVM-E), are proposed for crossselling using multitype multiway data. The MK-STM and the M-SVM-E can also perform feature selections from large sparse multitype multiway data. Based on these two methods, collaborative and non-collaborative ensemble learning frameworks are developed. In these frameworks, many existing classification and ensemble methods can be combined for classification using multitype multiway data. Computational experiments are conducted on two databases extracted from open access databases. The experimental results show that the MK-STM exhibits the best performance and has better performance than existing supervised tensor learning methods.
    Keywords: Audit committees, voluntary disclosures, director elections, auditor ratification
    JEL: M42
    Date: 2018–01–18
    URL: http://d.repec.org/n?u=RePEc:tsa:wpaper:0157acc&r=big
  4. By: Tianyu Ray Li; Anup S. Chamrajnagar; Xander R. Fong; Nicholas R. Rizik; Feng Fu
    Abstract: In this paper, we analyze Twitter signals as a medium for user sentiment to predict the price fluctuations of a small-cap alternative cryptocurrency called \emph{ZClassic}. We extracted tweets on an hourly basis for a period of 3.5 weeks, classifying each tweet as positive, neutral, or negative. We then compiled these tweets into an hourly sentiment index, creating an unweighted and weighted index, with the latter giving larger weight to retweets. These two indices, alongside the raw summations of positive, negative, and neutral sentiment were juxtaposed to $\sim 400$ data points of hourly pricing data to train an Extreme Gradient Boosting Regression Tree Model. Price predictions produced from this model were compared to historical price data, with the resulting predictions having a 0.81 correlation with the testing data. Our model's predictive data yielded statistical significance at the $p
    Date: 2018–05
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1805.00558&r=big
  5. By: Henry Chacon; Anuradha Roy (UTSA)
    Abstract: Outliers in the financial market data often carry important information, which requires attention and investigation. Many outlier detection techniques, including both parametric and nonparametric, have been developed over the years which are specific to certain application domains. Nonetheless, outlier detection is not an easy task, because sometimes the occurrence of them is pretty easy and evident, but in some other times, it may be extremely cumbersome. Financial series, which are not only pretty sensitive in reflecting the world market conditions due to the interactions of a very large number of participants in its operation, but also influenced by other stock markets that operate in other parts of the world, produce a non-synchronous process. In this research, we detect the presence of outliers in financial time series over the S&P 500 during the year 2016. We detect the beginning of some shocks (outliers) such as the Brexit referendum and the United States Presidential election held in the year 2016. Generally, the impacts of these events were not drastic. Histogram time series was implemented over a daily closing price on intervals of five minutes for the S&P 500 index during 2015 and 2016. In this case, the linear dependency between days of atypical returns were analyzed on quantiles [0 ?? 40]% and [60 ?? 100]%, while Wassertein distance and an approximation of entropy were used to quantify the presence of instant shocks in the index.
    Keywords: Big data, Outlier detection, Financial market, Histogram time series, Entropy
    JEL: M10 P20 A19
    Date: 2017–12–10
    URL: http://d.repec.org/n?u=RePEc:tsa:wpaper:o156mss&r=big
  6. By: Juliane Begenau; Maryam Farboodi; Laura Veldkamp
    Abstract: One of the most important trends in modern macroeconomics is the shift from small firms to large firms. At the same time, financial markets have been transformed by advances in information technology. We explore the hypothesis that the use of big data in financial markets has lowered the cost of capital for large firms, relative to small ones, enabling large firms to grow larger. Large firms, with more economic activity and a longer firm history offer more data to process. As faster processors crunch ever more data – macro announcements, earnings statements, competitors' performance metrics, export demand, etc. – large firms become more valuable targets for this data analysis. Once processed, that data can better forecast firm value, reduce the risk of equity investment, and thus reduce the firm's cost of capital. As big data technology improves, large firms attract a more than proportional share of the data processing, enabling large firms to invest cheaply and grow larger.
    JEL: E2 G1 D8
    Date: 2018–04
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:24550&r=big
  7. By: Auke Rijpma (Department of History, Utrecht University); Jeanne Cilliers (Department of Economic History, Lund University); Johan Fourie (Department of Economics, Stellenbosch University)
    Abstract: In this paper we describe the record linkage procedure to create a panel from Cape Colony census returns, or opgaafrolle, for 1787--1828, a dataset of 42,354 household-level observations. Based on a subset of manually linked records, we first evaluate statistical models and deterministic algorithms to best identify and match households over time. By using household-level characteristics in the linking process and near-annual data, we are able to create high-quality links for 84 percent of the dataset. We compare basic analyses on the linked panel dataset to the original cross-sectional data, evaluate the feasibility of the strategy when linking to supplementary sources, and discuss the scalability of our approach to the full Cape panel.
    Keywords: census, machine learning, micro-data, record linkage, panel data, South Africa
    JEL: N01 C81
    Date: 2018
    URL: http://d.repec.org/n?u=RePEc:sza:wpaper:wpapers299&r=big
  8. By: Sommervoll, Dag Einar (Centre for Land Tenure Studies, Norwegian University of Life Sciences); Sommervoll, Åvald (Department of Informatics)
    Abstract: House prices vary with location. At the same time the border between two neighboring housing markets tends to be fuzzy. When we seek to explain or predict house prices we need to correct for spatial price variation. A much used way is to include neighborhood dummy variables. In general, it is not clear how to choose a spatial subdivision in the vast space of all possible spatial aggregations. We take a biologically inspired approach, where different spatial aggregations mutate and recombine according to their explanatory power in a standard hedonic housing market model. We find that the genetic algorithm consistently finds aggregations that outperform conventional aggregation both in and out of sample. A comparison of best aggregations of different runs of the genetic algorithm shows that even though they converge to a similar high explanatory power, they tend to be genetically and economically different. Differences tend to be largely confined to areas with few housing market transactions.
    Keywords: House price prediction; Machine learning; Genetic algorithm; Spatial aggregation
    JEL: C45 R21 R31
    Date: 2018–04–24
    URL: http://d.repec.org/n?u=RePEc:hhs:nlsclt:2018_004&r=big
  9. By: Ajay Agrawal; John McHale; Alex Oettl
    Abstract: Innovation is often predicated on discovering useful new combinations of existing knowledge in highly complex knowledge spaces. These needle-in-a-haystack type problems are pervasive in fields like genomics, drug discovery, materials science, and particle physics. We develop a combinatorial-based knowledge production function and embed it in the classic Jones growth model (1995) to explore how breakthroughs in artificial intelligence (AI) that dramatically improve prediction accuracy about which combinations have the highest potential could enhance discovery rates and consequently economic growth. This production function is a generalization (and reinterpretation) of the Romer/Jones knowledge production function. Separate parameters control the extent of individual-researcher knowledge access, the effects of fishing out/complexity, and the ease of forming research teams.
    JEL: O3 O33 O4
    Date: 2018–04
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:24541&r=big

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.