nep-big New Economics Papers
on Big Data
Issue of 2019‒11‒04
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. Behind the headline number: Why not to rely on Frey and Osborne’s predictions of potential job loss from automation By Michael Coelli; Jeff Borland
  2. Popular Music, Sentiment, and Noise Trading By Kim Kaivanto; Peng Zhang
  3. I like, therefore I am. Predictive modeling to gain insights in political preference in a multi-party system By PRAET, Stiene; VAN AELST, Peter; MARTENS, David
  4. Keeping Real World Bias Out of Artificial Intelligence ?Examination of Coder Bias in Data Science Recruitment Solutions? By Yvette Burton
  5. Towards Legal Empirical Macrodynamics: A Research Agenda By Julia M. Puaschunder
  6. Boosting By Tae-Hwy Lee; Jianghao Chu; Aman Ullah; Ran Wang
  7. What is the Value Added by using Causal Machine Learning Methods in a Welfare Experiment Evaluation? By Strittmatter, Anthony
  8. Component-wise AdaBoost Algorithms for High-dimensional Binary Classi fication and Class Probability Prediction By Tae-Hwy Lee; Jianghao Chu; Aman Ullah
  9. Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality By Olivier Gu\'eant; Iuliia Manziuk
  10. Variable Selection in Sparse Semiparametric Single Index Models By Tae-Hwy Lee; Jianghao Chu; Aman Ullah
  11. Predicting Monetary Policy Using Artificial Neural Networks By Hinterlang, Natascha
  12. A Classifiers Voting Model for Exit Prediction of Privately Held Companies By Giuseppe Carlo Calafiore; Marisa Hillary Morales; Vittorio Tiozzo; Serge Marquie
  13. Deep convolutional autoencoder for cryptocurrency market analysis By Vladimir Puzyrev
  14. Дигитализация на селското стопанство и райони в България By Bachev, Hrabrin
  15. Narrative monetary policy surprises and the media By Saskia ter Ellen; Vegard H. Larsen; Leif Anders Thorsrud
  16. Bootstrap Aggregating and Random Forest By Tae-Hwy Lee; Aman Ullah; Ran Wang
  17. Global Rulemaking Strategy for Implementing Emerging Innovation: Case of Medical/Healthcare Robot, HAL by Cyberdyne (Japanese) By IKEDA Yoko; IIZUKA Michiko

  1. By: Michael Coelli (Department of Economics, The University of Melbourne); Jeff Borland (Department of Economics, The University of Melbourne)
    Abstract: We review a highly influential study that estimated potential job loss from advances in Artificial Intelligence and robotics: Frey and Osborne (FO) (2013, 2017) concluded that 47 per cent of jobs in the United States were at ‘high risk’ of automation in the next 10 to 20 years. First, we investigate FO’s methodology for estimating job loss. Several major problems and limitations are revealed; especially associated with the subjective designation of occupations as fully automatable. Second, we examine whether FO’s predictions can explain occupation-level changes in employment in the United States from 2013 to 2018. Compared to standard approaches which classify jobs based on their intensity in routine tasks, FO’s predictions do not ‘add value’ for forecasting the impact of technology on employment.
    Keywords: employment; technology; prediction; job loss; AI and robotics
    JEL: J21 O33
    Date: 2019–10
  2. By: Kim Kaivanto; Peng Zhang
    Abstract: We construct a sentiment indicator as the first principal component of thirteen emotion metrics derived from the lyrics and composition of music-chart singles. This indicator performs well, dominating the Michigan Index of Consumer Sentiment and bettering the Baker-Wurgler index in long-horizon regression tests as well as in out-of-sample forecasting tests. The music-sentiment indicator captures both signal and noise. The part associated with fundamentals predicts more distant market returns positively. The second part is orthogonal to fundamentals, and predicts one-month-ahead market returns negatively. This is evidence of noise trading explained by the emotive content of popular music.
    Keywords: investor sentiment, stock-return predictability, big data, textual analysis, natural language processing, popular music, noise trading, behavioural finance
    JEL: G12 G17 C55
    Date: 2019
  3. By: PRAET, Stiene; VAN AELST, Peter; MARTENS, David
    Abstract: In political sciences there is a long tradition of trying to understand party preferences and voting behavior to explain political decisions. Traditionally, scholars relied on voting histories, religious affiliation, and socio-economic status to understand people’s vote. Today, thanks to the Internet and social media, an unseen amount and granularity of data is available. In this paper we show how political insights can be gained from high-dimensional and sparse Facebook data, by building and interpreting predictive models based on Facebook ‘like’ and survey data of more than 6.500 Flemish participants. First, we built several logistic regression models to show that it is possible to predict political leaning and party preference based on Facebook likes in a multi-party system, even when excluding the political Facebook likes. Secondly, by introducing several metrics that measure the association between Facebook likes and a certain political affiliation, we can describe voter profiles in terms of common interests. For example, left voters often like environmental organizations and alternative rock music, whereas right voters like Flemish nationalistic content and techno music. Lastly, we develop a method to measure ideological homogeneity, or to what extent do people that like the same products, movies, books, etc. have a similar political ideology. In the Flemish setting, the categories ‘politics’ and ‘civil society’ are most ideologically homogeneous whereas ‘TV shows’ and ‘sports’ are the most heterogeneous. The results show that our approach has the potential to help political scientists to gain insights into voter profiles and ideological homogeneity using Facebook likes.
    Date: 2018–12
  4. By: Yvette Burton (Columbia University School of Professional Studies)
    Abstract: Research Question and Objectives: Is there subtle gender bias in the way companies word and code job listings in such fields as engineering and programming? Although the Civil Rights Act effectively bans companies from explicitly requesting workers of a particular gender, the language in these listings may discourage many women from applying.The objectives of the research are to create to foundational constructs leaders can use to address the growing employee competency and business performance gaps created by the impact of lack of gender diversity among data scientist roles, and siloes across enterprise talent strategies. These two objectives include: Integrated Data Scientist and HCM Leadership Development Strategies and AI Leadership Assessment and Development w/ Risk Audits.
    Keywords: Coding Bias, Artificial Intelligence, Data Scientists, Leadership Development, Business Performance, Digital Workforce Solutions, Behavioral Analytics, Twenty-First Century Skills Gaps, Human Capital Management, STEM, Enterprise Risk Management.
    JEL: C89 D81 J24
    Date: 2019–07
  5. By: Julia M. Puaschunder (The New School, Department of Economics)
    Abstract: Legal scholarship exists since the beginnings of science. Attempts to quantify the economic consequences of legal codifications have been made in the vibrant interdisciplinary field of law and economics. The socio-economic calculus of public policies has been addressed in new public management. Behavioral economics has entered the legal scientific discourse in the emerging field of empirical legal studies backed by ample evidence of the effect of law on socio-dynamics retrieved from manifold field and laboratory experiments. Behavioral insights is the most recent Nobel Prize crowned development to understand human decision making in the legal and public fields to help civil servants and legal executives foster the socio-economic outcomes of their work. All these variant interdisciplinary approaches aim at enlightening at legal codifications’ socio-economic outcomes to improve public collectives. In all these cases, empirics derived from quantitative and qualitative research help gain inferences for legal theory building and the strengthening of public policy implementations. This article argues that the time is ripe to dare the next step in legal empirical analyses by drawing from insights retrieved from big data and algorithmic machine learning but also introduce the use of optimal control macrodynamic modelling—a methodology originating in physics that entered macroeconomics and related disciplines to quantify and optimally control economic theory and practice. Given the ongoing big data revolution and exponentially rising data transfer coupled with unprecedented computational power advancements, the means are now available—for the first time—to push for empirical legal studies embracing novel tools—such as hierarchical modelling Bayesian statistics as well as optimum control sophistications—to derive inferences on how to improve legal theory and practices in innovative ways as never before possible. On the brink of artificial intelligence entering the labor force at a large scale, legal scholarship can now adapt to the novel market opportunities with acknowledging unprecedented computational power and methodological sophistication in deriving insights from big data. Heralding a new age of legal empirical macrodynamics also serves the legal community in light of the predicted heightening demands for creativity as future valuable asset of humanoid legal practitioners and scholars in comparison to repetitive tasks likely soon being outsourced to AI and machine learning.
    Keywords: Artificial Intelligence, Behavioral Economics, Behavioral Political Economy, Big Data, Governance, Machine Learning, New Public Management, Legal Scholarship, Optimal Control, Social Credit Score
    Date: 2019–08
  6. By: Tae-Hwy Lee (Department of Economics, University of California Riverside); Jianghao Chu (University of California, Riverside); Aman Ullah (University of California, Riverside); Ran Wang (University of California, Riverside)
    Abstract: In the era of Big Data, selecting relevant variables from a potentially large pool of candidate variables becomes a newly emerged concern in macroeconomic researches, especially when the data available is high-dimensional, i.e. the number of explanatory variables (p) is greater than the length of the sample size (n). Common approaches include factor models, the principal component analysis and regularized regressions. However, these methods require additional assumptions that are hard to verify and/or introduce biases or aggregated factors which complicate the interpretation of the estimated outputs. This chapter reviews an alternative solution, namely Boosting, which is able to estimate the variables of interest consistently under fairly general conditions given a large set of explanatory variables. Boosting is fast and easy to implement which makes it one of the most popular machine learning algorithms in academia and industry.
    Keywords: Boosting, AdaBoost, Gradient Boosting, Functional Gradient Descent, Decision Tree, Shrinkage
    JEL: C2 C3 C4 C5
    Date: 2019–05
  7. By: Strittmatter, Anthony
    JEL: H75 I38 J22 J31 C21
    Date: 2019
  8. By: Tae-Hwy Lee (Department of Economics, University of California Riverside); Jianghao Chu (UCR); Aman Ullah (UCR)
    Abstract: Freund and Schapire (1997) introduced "Discrete AdaBoost" (DAB) which has been mysteriously effective for the high-dimensional binary classi cation or binary prediction. In an effort to understand the myth, Friedman, Hastie and Tibshirani (FHT, 2000) show that DAB can be understood as statistical learning which builds an additive logistic regression model via Newton-like updating minimization of the exponential loss. From this statistical point of view, FHT proposed three modi fications of DAB, namely, Real AdaBoost (RAB), LogitBoost (LB), and Gentle AdaBoost (GAB). All of DAB, RAB, LB, GAB solve for the logistic regression via different algorithmic designs and different objective functions. The RAB algorithm uses class probability estimates to construct real-valued contributions of the weak learner, LB is an adaptive Newton algorithm by stagewise optimization of the Bernoulli likelihood, and GAB is an adaptive Newton algorithm via stagewise optimization of the exponential loss. The same authors of FHT published an influential textbook, The Elements of Statistical Learn- ing (ESL, 2001 and 2008). A companion book An Introduction to Statistical Learning (ISL) by James et al. (2013) was published with applications in R. However, both ESL and ISL (e.g., sections 4.5 and 4.6) do not cover these four AdaBoost algorithms while FHT provided some simulation and empirical studies to compare these methods. Given numerous potential applications, we believe it would be useful to collect the R libraries of these AdaBoost algorithms, as well as more recently developed extensions to Ad- aBoost for probability prediction with examples and illustrations. Therefore, the goal of this chapter is to do just that, i.e., (i) to provide a user guide of these alternative AdaBoost algorithms with step-by-step tutorial of using R (in a way similar to ISL, e.g., Section 4.6), (ii) to compare AdaBoost with alternative machine learning classi fication tools such as the deep neural network (DNN), logistic regression with LASSO and SIM-RODEO, and (iii) to demonstrate the empirical applications in economics, such as prediction of business cycle turning points and directional prediction of stock price indexes. We revisit Ng (2014) who used DAB for prediction of the business cycle turning points by comparing the results from RAB, LB, GAB, DNN, logistic regression and SIM-RODEO.
    Keywords: AdaBoost, R, Binary classi cation, Logistic regression, DAB, RAB, LB, GAB, DNN
    Date: 2018–07
  9. By: Olivier Gu\'eant; Iuliia Manziuk
    Abstract: In corporate bond markets, which are mainly OTC markets, market makers play a central role by providing bid and ask prices for a large number of bonds to asset managers from all around the globe. Determining the optimal bid and ask quotes that a market maker should set for a given universe of bonds is a complex task. Useful models exist, most of them inspired by that of Avellaneda and Stoikov. These models describe the complex optimization problem faced by market makers: proposing bid and ask prices in an optimal way for making money out of the difference between bid and ask prices while mitigating the market risk associated with holding inventory. While most of the models only tackle one-asset market making, they can often be generalized to a multi-asset framework. However, the problem of solving numerically the equations characterizing the optimal bid and ask quotes is seldom tackled in the literature, especially in high dimension. In this paper, our goal is to propose a numerical method for approximating the optimal bid and ask quotes over a large universe of bonds in a model \`a la Avellaneda-Stoikov. Because we aim at considering a large universe of bonds, classical finite difference methods as those discussed in the literature cannot be used and we present therefore a discrete-time method inspired by reinforcement learning techniques. More precisely, the approach we propose is a model-based actor-critic-like algorithm involving deep neural networks.
    Date: 2019–10
  10. By: Tae-Hwy Lee (Department of Economics, University of California Riverside); Jianghao Chu (UCR); Aman Ullah (UCR)
    Abstract: In this paper we consider the "Regularization of Derivative Expectation Operator" (Rodeo) of Lafferty and Wasserman (2008) and propose a modified Rodeo algorithm for semiparametric single index models in big data environment with many regressors. The method assumes sparsity that many of the regressors are irrelevant. It uses a greedy algorithm, in that, to estimate the semiparametric single index model (SIM) of Ichimura (1993), all coefficients of the regressors are initially set to start from near zero, then we test iteratively if the derivative of the regression function estimator with respect to each coefficient is significantly different from zero. The basic idea of the modified Rodeo algorithm for SIM (to be called SIM-Rodeo) is to view the local bandwidth selection as a variable selection scheme which amplifies the coefficients for relevant variables while keeping the coefficients of irrelevant variables relatively small or at the initial starting values near zero. For sparse semiparametric single index models, the SIM-Rodeo algorithm is shown to attain consistency in variable selection. In addition, the algorithm is fast to finish the greedy steps. We compare SIM-Rodeo with SIM-Lasso method in Zeng et al. (2012). Our simulation results demonstrate that the proposed SIM-Rodeo method is consistent for variable selection and show that it has smaller integrated mean squared errors than SIM-Lasso.
    Keywords: Single index model (SIM), Variable selection, Rodeo, SIM-Rodeo, Lasso, SIM-Lasso.
    JEL: C25 C44 C53 C55
    Date: 2018–09
  11. By: Hinterlang, Natascha
    JEL: C45 C53 E47
    Date: 2019
  12. By: Giuseppe Carlo Calafiore; Marisa Hillary Morales; Vittorio Tiozzo; Serge Marquie
    Abstract: Predicting the exit (e.g. bankrupt, acquisition, etc.) of privately held companies is a current and relevant problem for investment firms. The difficulty of the problem stems from the lack of reliable, quantitative and publicly available data. In this paper, we contribute to this endeavour by constructing an exit predictor model based on qualitative data, which blends the outcomes of three classifiers, namely, a Logistic Regression model, a Random Forest model, and a Support Vector Machine model. The output of the combined model is selected on the basis of the majority of the output classes of the component models. The models are trained using data extracted from the Thomson Reuters Eikon repository of 54697 US and European companies over the 1996-2011 time span. Experiments have been conducted for predicting whether the company eventually either gets acquired or goes public (IPO), against the complementary event that it remains private or goes bankrupt, in the considered time window. Our model achieves a 63\% predictive accuracy, which is quite a valuable figure for Private Equity investors, who typically expect very high returns from successful investments.
    Date: 2019–10
  13. By: Vladimir Puzyrev
    Abstract: This study attempts to analyze patterns in cryptocurrency markets using a special type of deep neural networks, namely a convolutional autoencoder. The method extracts the dominant features of market behavior and classifies the 40 studied cryptocurrencies into several classes for twelve 6-month periods starting from 15th May 2013. Transitions from one class to another with time are related to the maturement of cryptocurrencies. In speculative cryptocurrency markets, these findings have potential implications for investment and trading strategies.
    Date: 2019–10
  14. By: Bachev, Hrabrin
    Abstract: Despite its big theoretical and practical importance in Bulgaria there are no comprehensive analysis of the state and evolution of digitalisation in agriculture and rural areas. The goal of this study is to analyse the state and development of digitalization in the country and in agrarian sphere in Bulgaria, revile major trends in that area, compare the situation with other EU countries, identify main problems, and make recommendation for improving policies in the next programing period. Analysis has found out that in recent years there is considerable improvement of the access of Bulgarian households to internet as well as a significant increase in the persons using internet for relations with public institutions and trading goods and services. Nevertheless, Bulgaria is quite behind from other EU members in regards to introduction of digital technologies in the economy and society taking one of the last places in EU in terms of Integral Index for Introduction of Digital Technologies in the Economy and Society – DESI. There is a great variation on the extent of digitalisation in different subsectors of agriculture, farms of different juridical type and size, and different regions of the country. Most agricultural holdings are not aware with the content of digital agriculture as 14% apply modern digital technologies. Major obstacles for introduction of digital technologies are qualification of employees, amount of required investment, unclear economic benefits, and data security. Main areas where state administration actions are required are: support of measures for supplementary training of labour, tax preferences in planning of actions and digitalisation of activity, stimulation of young specialists, introduction of internationally recognized processes of standardisation and certification, adaptation of legislation in the area of data protection, and securing reliable and high speed networks.
    Keywords: digitalisation, agriculture, rural, Bulgaria
    JEL: O3 O32 O33 O35 Q01 Q1 Q18
    Date: 2019–10–19
  15. By: Saskia ter Ellen; Vegard H. Larsen; Leif Anders Thorsrud
    Abstract: We propose a method to quantify narratives from textual data in a structured manner, and identify what we label "narrative monetary policy surprises" as the change in economic media coverage explained by central bank communication accompanying interest rate meetings. Our proposed method is fast and simple, and relies on a Singular Value Decomposition of the different texts and articles coupled with a unit rotation identifi cation scheme. Identifying narrative surprises in central bank communication using this type of data and identifi cation provides surprise measures that are uncorrelated with conventional monetary policy surprises, and, in contrast to such surprises, have a signifi cant effect on subsequent media coverage. In turn, narrative monetary policy surprises lead to macroeconomic responses similar to what recent monetary policy literature associates with the information component of monetary policy communication. Our study highlights the importance of written central bank communication and the role of the media as information intermediaries.
    Keywords: communication, monetary policy, factor identification, textual data
  16. By: Tae-Hwy Lee (Department of Economics, University of California Riverside); Aman Ullah (University of California, Riverside); Ran Wang (University of California, Riverside)
    Abstract: Bootstrap Aggregating (Bagging) is an ensemble technique for improving the robustness of forecasts. Random Forest is a successful method based on Bagging and Decision Trees. In this chapter, we explore Bagging, Random Forest, and their variants in various aspects of theory and practice. We also discuss applications based on these methods in economic forecasting and inference.
    Keywords: bagging, decision trees, random forests, forecasting
    JEL: C2 C3 C4 C5
    Date: 2019–07
  17. By: IKEDA Yoko; IIZUKA Michiko
    Abstract: Under the Fourth Industrial Revolution, innovations using emerging technologies (artificial intelligence, robotics, the internet of things), are said to improve productivity and quality of life. On the other hand, the diffusion of such innovation can involve risks and uncertainties regarding safety. Generally, these risks are managed by government by means of regulation. However, it may pose challenges to firms for commercialization because emerging innovations often do not come under existing product categories nor corresponding regulations. This study answers (1) how products based on emerging technology can be commercialized, overcoming existing regulatory barriers on safety, and (2) what is the role played by standards, one type of regulation, through an examination of the case of Cyberdyne, a successful medical/healthcare robotics company in Japan. Cyberdyne developed and commercialized the world's first product using Cybernics in a wearable medical/healthcare device. The case illustrates the increasing complexity of safety regulations and a new role for standards, which is to enable firms to implement emerging technologies. It concludes with an exploration of policy considerations from a regulatory perspective in dealing with emerging technologies.
    Date: 2019–10

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.