nep-big New Economics Papers
on Big Data
Issue of 2018‒04‒16
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. The Effect of Big Data on Recommendation Quality: The Example of Internet Search By Maximilian Schäfer; Geza Sapi; Szabolcs Lorincz
  2. The Roles of Alternative Data and Machine Learning in Fintech Lending: Evidence from the LendingClub Consumer Platform By Jagtiani, Julapa; Lemieux, Catharine
  3. Consumers' Privacy Choices in the Era of Big Data By Prüfer, Jens; Dengler, Sebastian
  4. Shining a Light on Purchasing Power Parities By Maxim Pinkovskiy; Xavier Sala-i-Martin
  5. Monetary Policy Communication of the Bank of Japan: Computational Text Analysis By Yusuke Oshima; Yoichi Matsubayashi
  6. Reducing Estimation Risk in Mean-Variance Portfolios with Machine Learning By Daniel Kinn
  7. Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio By Alexei Botchkarev
  8. Classifying Occupations According to Their Skill Requirements in Job Advertisements By Jyldyz Djumalieva; Antonio Lima; Cath Sleeman
  9. A Survey of Big Data Technologies and Internet of Things for Economic Growth and Sustainable Development By Paul Adeoye Omosebi; Adetunji Philip Adewole
  10. Inventor Name Disambiguation with Gradient Boosting Decision Tree and Inventor Mobility in China (1985-2016) By YIN Deyun; MOTOHASHI Kazuyuki
  11. Speaking sociologically with big data: symphonic social science and the future for big data research By Halford, Susan; Savage, Mike
  12. The Digital Era, Viewed From a Perspective of Millennia of Economic Growth By Jakub Growiec
  13. Monopsony in Online Labor Markets By Arindrajit Dube; Jeff Jacobs; Suresh Naidu; Siddharth Suri
  14. Ethics, algorithms and self-driving cars – a CSI of the ‘trolley problem’ By Renda, Andrea
  15. Exploring the predictability of range-based volatility estimators using RNNs By G\'abor Petneh\'azi; J\'ozsef G\'all
  16. Predictive modeling of stock indices closing from web search trends By Arjun R; Suprabha KR
  17. Business cycle narratives By Vegard Høghaug Larsen; Leif Anders Thorsrud
  18. Economic policy uncertainty and stock market participation By Gábor-Tóth, Enikő; Georgarakos, Dimitris

  1. By: Maximilian Schäfer; Geza Sapi; Szabolcs Lorincz
    Abstract: Are there economies of scale to data in internet search? This paper is first to use real search engine query logs to empirically investigate how data drives the quality of internet search results. We find evidence that the quality of search results improve with more data on previous searches. Moreover, our results indicate that the type of data matters as well: personalized information is particularly valuable as it massively increases the speed of learning. We also provide some evidence that factors not directly related to data such as the general quality of the applied algorithms play an important role. The suggested methods to disentangle the effect of data from other factors driving the quality of search results can be applied to assess the returns to data in various recommendation systems in e-commerce, including product and information search. We also discuss the managerial, privacy, and competition policy implications of our findings.
    Keywords: Big Data, Recommendation quality, Internet search, E-Commerce, Economies of Scale, Search engines
    JEL: L81 L86 M15
    Date: 2018
  2. By: Jagtiani, Julapa (Federal Reserve Bank of Philadelphia); Lemieux, Catharine (Federal Reserve Bank of Chicago)
    Abstract: Supersedes Working Paper 17-17. Fintech has been playing an increasing role in shaping financial and banking landscapes. There have been concerns about the use of alternative data sources by fintech lenders and the impact on financial inclusion. We compare loans made by a large fintech lender and similar loans that were originated through traditional banking channels. Specifically, we use account-level data from LendingClub and Y-14M data reported by bank holding companies with total assets of $50 billion or more. We find a high correlation with interest rate spreads, LendingClub rating grades, and loan performance. Interestingly, the correlations between the rating grades and FICO scores have declined from about 80 percent (for loans that were originated in 2007) to only about 35 percent for recent vintages (originated in 2014–2015), indicating that nontraditional alternative data have been increasingly used by fintech lenders. Furthermore, we find that the rating grades (assigned based on alternative data) perform well in predicting loan performance over the two years after origination. The use of alternative data has allowed some borrowers who would have been classified as subprime by traditional criteria to be slotted into “better” loan grades, which allowed them to get lower priced credit. In addition, for the same risk of default, consumers pay smaller spreads on loans from LendingClub than from credit card borrowing.
    Keywords: Fintech; LendingClub; Marketplace Lending; Alternative Data; Shadow Banking; P2P Lending; Peer-to-peer Lending
    JEL: G18 G21 G28 L21
    Date: 2018–04–05
  3. By: Prüfer, Jens (Tilburg University, Center For Economic Research); Dengler, Sebastian (Tilburg University, Center For Economic Research)
    Abstract: Recent progress in information technologies provides sellers with detailed knowledge about consumers' preferences, approaching perfect price discrimination in the limit. We construct a model where consumers with less strategic sophistication than the seller's pricing algorithm face a trade-off when buying. They choose between a direct, transaction cost-free sales channel and a privacy-protecting, but costly, anonymous channel. We show that the anonymous channel is used even in the absence of an explicit taste for privacy if consumers are not too strategically sophisticated. This provides a micro-foundation for consumers' privacy choices. Some consumers benefit but others suffer from their anonymization.
    Keywords: privacy; big data; perfect price discrimination; level-k thinking
    JEL: L11 D11 D83 D01 L86
    Date: 2018
  4. By: Maxim Pinkovskiy; Xavier Sala-i-Martin
    Abstract: Nighttime lights data are a measure of economic activity whose error is plausibly independent of the measurement errors of most conventional indicators. Therefore, we can use nighttime lights as an independent benchmark to assess existing measures of economic activity (Pinkovskiy and Sala-i-Martin (2016)). We employ this insight to generate three findings in the study of PPP-adjusted estimates of GDP around the world between 1992 and 2010. First, we find that while market exchange rates described poor economies better than did PPP-adjusted estimates in the late 1990s (Dowrick and Akmal 2008; Almas 2012), this pattern has disappeared by the 2010s. Second, we also find that estimates of PPPs have been steadily improving from one price survey round to the next, including during the controversial 2005 and 2011 rounds. Third, we leverage this fact to assess whether it is optimal to measure relative prices as close as possible to the year of interest or to use the latest available relative price data and discard the rest, and provide a theoretical framework in which the latter may be optimal. Using data from the Penn World Tables, we find that, indeed, it is optimal to only use the latest price data, and hence, to revise existing PPP-adjusted estimates whenever a new price survey is released.
    JEL: A1 E01 F00
    Date: 2018–03
  5. By: Yusuke Oshima (Graduate School of Economics, Kobe University); Yoichi Matsubayashi (Graduate School of Economics, Kobe University)
    Abstract: In this study, we empirically examine the effects of the Bank of Japan (BOJ)'s communications through its meeting minutes on the financial markets, especially during Mr. Kuroda's administration from April 2013 to September 2017. Using computational linguistic models and the Latent Dirichlet Allocation, we quantify the contents of the BOJ minutes and extract topics form these minutes, including the bank's historical monetary policy and policymakers' views on current economic conditions. The empirical results suggest that a relationship exists between the estimated topics and the market reactions on the days on which the minutes are released. Although the market paid attention to the monetary policy description in the minutes in the early period of the introduction of quantitative and qualitative monetary easing (QQE), the significance of monetary policy information under the October 2014 expansion of the QQE on financial markets faded. In contrast, information on fund-provisioning measures to support Japanese companies' activities, including a negative interest rate policy, induced a decline in the stock market. We found that the market pays attention to meeting members' opinions on current economic conditions.
    Date: 2018–04
  6. By: Daniel Kinn
    Abstract: In portfolio analysis, the traditional approach of replacing population moments with sample counterparts may lead to suboptimal portfolio choices. In this paper I show that selecting asset positions to maximize expected quadratic utility is equivalent to a machine learning (ML) problem, where the asset weights are chosen to minimize out of sample mean squared error. It follows that ML specifically targets estimation risk when choosing the asset weights, and that "off-the-shelf" ML algorithms obtain optimal portfolios taking parameter uncertainty into account. Linear regression is a special case of the proposed ML framework, equivalent to the traditional approach. Standard results from the machine learning literature may be used to derive conditions for when ML algorithms improve upon linear regression. Based on simulation studies and several datasets, I find that ML significantly reduce estimation risk compared to the traditional approach and several shrinkage approaches proposed in the literature.
    Date: 2018–04
  7. By: Alexei Botchkarev
    Abstract: Ability for accurate hospital case cost modelling and prediction is critical for efficient health care financial management and budgetary planning. A variety of regression machine learning algorithms are known to be effective for health care cost predictions. The purpose of this experiment was to build an Azure Machine Learning Studio tool for rapid assessment of multiple types of regression models. The tool offers environment for comparing 14 types of regression models in a unified experiment: linear regression, Bayesian linear regression, decision forest regression, boosted decision tree regression, neural network regression, Poisson regression, Gaussian processes for regression, gradient boosted machine, nonlinear least squares regression, projection pursuit regression, random forest regression, robust regression, robust regression with mm-type estimators, support vector regression. The tool presents assessment results arranged by model accuracy in a single table using five performance metrics. Evaluation of regression machine learning models for performing hospital case cost prediction demonstrated advantage of robust regression model, boosted decision tree regression and decision forest regression. The operational tool has been published to the web and openly available for experiments and extensions.
    Date: 2018–04
  8. By: Jyldyz Djumalieva; Antonio Lima; Cath Sleeman
    Abstract: In this work, we propose a methodology for classifying occupations based on skill requirements provided in online job adverts. To develop the classification methodology, we apply semi-supervised machine learning techniques to a dataset of 37 million UK online job adverts collected by Burning Glass Technologies. The resulting occupational classification comprises four hierarchical layers: the first three layers relate to skill specialisation and group jobs that require similar types of skills. The fourth layer of the hierarchy is based on the offered salary and indicates skill level. The proposed classification will have the potential to enable measurement of an individual's career progression within the same skill domain, to recommend jobs to individuals based on their skills and to mitigate occupational misclassification issues. While we provide initial results and descriptions of occupational groups in the Burning Glass data, we believe that the main contribution of this work is the methodology for grouping jobs into occupations based on skills.
    Keywords: labour demand, occupational classification, online job adverts, big data, machine learning, word embeddings
    JEL: C18 J23 J24
    Date: 2018–03
  9. By: Paul Adeoye Omosebi (Centre for Econometric and Allied Research, University of Ibadan. Department of Computer Sciences, University of Lagos.); Adetunji Philip Adewole (Department of Computer Sciences, University of Lagos.)
    Abstract: Big Data is a source of innovation that has captured the attention of citizens and decision makers in both the public and private sectors. Making use of the technology innovations in big data could contribute to economic growth and sustainable development and to capture the explosive growth of big data. For some time now, the world has stepped up in its focus on evidence based policy making and monitoring of development progress, hence the measurement and analysis of diverse sources of data, combined with advanced analytics, promise to create value for decision makers and society hence for economic growth and development. There are 17 Sustainable Development Goals (SDGs), 169 SDG targets and 230 SDG indicators, The 17 Sustainable Development Goals and 169 targets demonstrate the scale and ambition of this new universal Agenda of countries to collect and maintain relevant standardized data such that it will support domestic technology development, research and innovation in developing countries. This paper highlights the new technological innovations in big data and cloud computing which can lead to economic growth and sustainable development. Also, we present a comprehensive survey of the Big Data challenges, Big Data technology challenges, cloud computing and relevant technology landscape like Internet of Things (IoT) towards economic growth and technological innovation.
    Keywords: big data, cloud computing, technology innovation, sustainable development, internet of things
    JEL: M15 O32 O40 Q01
    Date: 2018–03
  10. By: YIN Deyun; MOTOHASHI Kazuyuki
    Abstract: This paper presents the first systematic disambiguation result of all Chinese patent inventors in the State Intellectual Property Office of China (SIPO) patent database from 1985 to 2016. We provide a method of constructing high-qualitative training data from lists of rare names and evidence for the reliability of these generated labels when large-scale and representative hand-labeled data are crucial but expensive, prone to error, and even impossible to obtain. We then compare the performances of seven supervised models, i.e., naive Bayes, logistic, linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), as well as tree-based methods (random forest, AdaBoost, and gradient boosting decision trees), and found that gradient boosting classifier outperforms all other classifiers with the highest F1-score and stable performance in solving the homonym problem prevailing in Chinese names. In the last step, instead of adopting the more popular hierarchical clustering method, we clustered records with the density-based spatial clustering of applications with noise (DBSCAN) based on the distance matrix predicated by the GBDT classifier. Varying across different testing data and parameters of DBSCAN, our algorithm yielded a F1-score ranging from 93.5%-99.3% with splitting error within the range 0.5%-3% and lumping error between 0.056%-0.37%. Based on our disambiguated result, we provide an overview of Chinese inventors' regional mobility.
    Date: 2018–03
  11. By: Halford, Susan; Savage, Mike
    Abstract: Recent years have seen persistent tension between proponents of big data analytics, using new forms of digital data to make computational and statistical claims about ‘the social’, and many sociologists sceptical about the value of big data, its associated methods and claims to knowledge. We seek to move beyond this, taking inspiration from a mode of argumentation pursued by Piketty, Putnam and Wilkinson and Pickett that we label ‘symphonic social science’. This bears both striking similarities and significant differences to the big data paradigm and – as such – offers the potential to do big data analytics differently. This offers value to those already working with big data – for whom the difficulties of making useful and sustainable claims about the social are increasingly apparent – and to sociologists, offering a mode of practice that might shape big data analytics for the future
    Keywords: big data; computational methods; sociology; symphonic social science; visualisation
    JEL: C1
    Date: 2017–12–01
  12. By: Jakub Growiec
    Abstract: I propose a synthetic theory of economic growth and technological progress over the entire human history. Based on this theory as well as on the analogies with three previous eras (the hunter-gatherer era, the agricultural era and the industrial era) and the technological revolutions which initiated them, I draw conclusions for the contemporary digital era. I argue that each opening of a new era adds a new, previously inactive dimension of economic development, and redefines the key inputs and output of the production process. Economic growth accelerates across the consecutive eras, but there are also big shifts in factor shares and inequality. The two key inputs to the digital-era production process are hardware and software. Human skilled labor is complementary to hardware and substitutable with software, which increasingly includes sophisticated artificial intelligence (AI) technologies. I also argue that economists have not yet designed sufficient measurement tools, economic policies and institutions appropriate for the digital-era economy
    Keywords: economic growth, technological progress, unified growth theory, digital economy, artificial intelligence.
    JEL: O10 O30 O40
    Date: 2018–04
  13. By: Arindrajit Dube; Jeff Jacobs; Suresh Naidu; Siddharth Suri
    Abstract: On-demand labor platforms make up a large part of the “gig economy.” We quantify the extent of monopsony power in one of the largest on-demand labor platforms, Amazon Mechanical Turk (MTurk), by measuring the elasticity of labor supply facing the requester (employer) using both observational and experimental variation in wages. We isolate plausibly exogenous variation in rewards using a double-machine-learning estimator applied to a large dataset of scraped MTurk tasks. We also re-analyze data from 5 MTurk experiments that randomized payments to obtain corresponding experimental estimates. Both approaches yield uniformly low labor supply elasticities, around 0.1, with little heterogeneity.
    JEL: J01 J42
    Date: 2018–03
  14. By: Renda, Andrea
    Abstract: Many experts argue that focusing on how automated cars will solve the dilemma known as the ‘trolley problem’ isn’t going to get us very far in the debate about the ethics of artificial intelligence (AI). But it’s hard to resist if you are a philosopher, an ethicist, a futurist, or simply a geek – and it’s fun. Still, this dilemma can reveal a number of outstanding policy issues that are often neglected in the public debate. This paper performs a ‘crime scene investigation’ to find some of the missing parts in the ethics/AI quandary. These include the need to preserve human control over machines; the need to take data governance and ownership seriously; algorithmic accountability and transparency; various forms of user empowerment and their tension in relation to overall system control; the need for modernised tort rules; and more generally, a discussion about whether algorithms should reflect, exacerbate or mitigate the biases existing in our society. The investigation concludes that current legal systems are insufficiently equipped to cope with most of these issues, and that a mapping of outstanding ethical and policy dilemmas is a useful starting point for a thorough overhaul of public policies in this complex and ever-expanding domain.
    Date: 2018–01
  15. By: G\'abor Petneh\'azi; J\'ozsef G\'all
    Abstract: We investigate the predictability of several range-based stock volatility estimators, and compare them to the standard close-to-close estimator which is most commonly acknowledged as the volatility. The patterns of volatility changes are analyzed using LSTM recurrent neural networks, which are a state of the art method of sequence learning. We implement the analysis on all current constituents of the Dow Jones Industrial Average index, and report averaged evaluation results. We find that changes in the values of range-based estimators are more predictable than that of the estimator using daily closing values only.
    Date: 2018–03
  16. By: Arjun R; Suprabha KR
    Abstract: The study aims to explore the strength of causal relationship between stock price search interest and real stock market outcomes on worldwide equity market indices. Such a phenomenon could also be mediated by investor behavior and extent of news coverage. The stock-specific internet search trends data and corresponding index close values from different countries stock exchanges are collected and analyzed. Empirical findings show global stock price search interests correlates more with developing economies with fewer effects in south asian stock exchanges apart from strong influence in western countries. Finally this study calls for development in expert decision support systems with the synthesis of using big data sources on forecasting market outcomes
    Date: 2018–04
  17. By: Vegard Høghaug Larsen; Leif Anders Thorsrud
    Abstract: This article quantifies the epidemiology of media narratives relevant to business cycles in the US, Japan, and Europe (euro area). We do so by first constructing daily business cycle indexes computed on the basis of the news topics the media writes about. At a broad level, the most in uential news narratives are shown to be associated with general macroeconomic developments, finance, and (geo-)politics. However, a large set of narratives contributes to our index estimates across time, especially in times of expansion. In times of trouble, narratives associated with economic uctuations become more sparse. Likewise, we show that narratives do go viral, but mostly so when growth is low. While narratives interact in complicated ways, we document that some are clearly associated with economic fundamentals. Other narratives, on the other hand, show no such relationship, and are likely better explained by classical work capturing the market's animal spirits.
    Keywords: Business cycles, Narratives, Dynamic Factor Model (DFM), Latent Dirichlet Allocation (LDA)
    Date: 2018–04
  18. By: Gábor-Tóth, Enikő; Georgarakos, Dimitris
    Abstract: Does economic policy uncertainty affect household stockholding? To answer this question we create a novel measure of household exposure to economic policy uncertainty news by combining survey information on the hours a household spends in reading newspapers and the frequency of such news in the popular press during a household's pre-interview period. After controlling for household fixed effects, month-year fixed effects and time-varying cognitive skills, we find that households with a higher exposure to economic policy uncertainty news are less likely to invest in stocks held directly or through mutual funds. This effect is independent from the market volatility index and household (first-moment) expectations about the stock market index.
    Keywords: economic policy uncertainty,household finance,stockholding,text analysis
    JEL: D14 D81 G11
    Date: 2018

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.