nep-big New Economics Papers
on Big Data
Issue of 2018‒11‒05
fourteen papers chosen by
Tom Coupé
University of Canterbury

  1. Racing With or Against the Machine? : Evidence from Europe By Terry Gregory; A.M. Salomons; Ulrich Zierahn
  2. Longitudinal Environmental Inequality and Environmental Gentrification: Who Gains From Cleaner Air? By John Voorheis
  3. Air Quality, Human Capital Formation and the Long-term Effects of Environmental Inequality at Birth By John Voorheis
  4. Predicting Match Outcomes in Football by an Ordered Forest Estimator By Goller, Daniel; Knaus, Michael C.; Lechner, Michael; Okasa, Gabriel
  5. Model Selection Techniques -- An Overview By Jie Ding; Vahid Tarokh; Yuhong Yang
  6. Improving Stock Movement Prediction with Adversarial Training By Fuli Feng; Huimin Chen; Xiangnan He; Ji Ding; Maosong Sun; Tat-Seng Chua
  7. Early Detection of Students at Risk – Predicting Student Dropouts Using Administrative Student Data and Machine Learning Methods By Johannes Berens; Kerstin Schneider; Simon Görtz; Simon Oster; Julian Burghoff
  8. Martingale Functional Control variates via Deep Learning By Marc Sabate Vidales; David Siska; Lukasz Szpruch
  9. Using Preference Vector Modeling to Polarity Shift for Improvement of Opinion Mining By Chihli Hung
  10. Using Deep Learning for price prediction by exploiting stationary limit order book features By Avraam Tsantekidis; Nikolaos Passalis; Anastasios Tefas; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
  11. Framing Discrete Choice Model as Deep Neural Network with Utility Interpretation By Shenhao Wang; Jinhua Zhao
  12. CNNPred: CNN-based stock market prediction using several data sources By Ehsan Hoseinzade; Saman Haratizadeh
  13. Deciphering Monetary Policy Committee Minutes with Text Mining Approach: A Case of South Korea By Youngjoon Lee; Soohyon Kim; Ki Young Park
  14. The Model Selection Curse By Kfir Eliaz; Ran Spiegler

  1. By: Terry Gregory; A.M. Salomons; Ulrich Zierahn
    Abstract: A fast-growing literature shows that digital technologies are displacing labor from routine tasks, raising concerns that labor is racing against the machine. We develop a task-based framework to estimate the aggregate labor demand and employment effects of routine replacing technological change (RRTC), along with the underlying mechanisms. We show that while RRTC has indeed had strong displacement effects in the European Union between 1999 and 2010, it has simultaneously created new jobs through increased product demand, outweighing displacement effects and resulting in net employment growth. However, we also show that this finding depends on the distribution of gains from technological progress
    Keywords: Labor Demand, Employment, Routine-Replacing Technological Change, Tasks, Local Demand Spillovers, T
    Date: 2018–09
  2. By: John Voorheis
    Abstract: A vast empirical literature has convincingly shown that there is pervasive cross-sectional inequality in exposure to environmental hazards. However, less is known about how these inequalities have been evolving over time. I fill this gap by creating a new dataset, which combines satellite data on ground-level concentrations of fine particulate matter with linked administrative and survey data. This linked dataset allows me to measure individual pollution exposure for over 100 million individuals in each year between 2000 and 2014, a period of time has seen substantial improvements in average air quality. This rich dataset can then be used to analyze longitudinal dimensions of environmental inequality by examining the distribution of changes in individual pollution exposure that underlie these aggregate improvements. I confirm previous findings that cross-sectional environmental inequality has been on the decline, but I argue that this may miss longitudinal patterns in exposure that are consistent with environmental gentrification. I find that advantaged individuals at the beginning of the sample experience larger pollution exposure reductions than do initially disadvantaged individuals.
    Date: 2017–05
  3. By: John Voorheis
    Abstract: A growing body of literature suggests that pollution exposure early in life can have substantial long term effects on an individual’s economic well-being as an adult, however the mechanisms for these effects remain unclear. I contribute to this literature by examining the effect of pollution exposure on several intermediate determinants of adult wages using a unique linked dataset for a large sample of individuals from two cohorts: an older cohort born around the 1970, and a younger cohort born around 1990. This dataset links responses to the American Community Survey to SSA administrative data, the universe of IRS Form 1040 tax returns, pollution concentration data from EPA air quality monitors and satellite remote sensing observations. In both OLS and IV specifications, I find that pollution exposure at birth has a large and economically significant effect on college attendance among 19-22 year olds. Using conventional estimates of the college wage premium, these effects imply that a 10 µg/m3 decrease in particulate matter exposure at birth is associated with a $190 per year increase in annual wages. This effect is smaller than the wage effects in the previous literature, which suggests that human capital acquisition associated with cognitive skills cannot fully explain the long term wage effects of pollution exposure. Indeed, I find evidence for an additional channel working through non-cognitive skill -pollution exposure at birth increases high school non-completion and incarceration among 16-24 year olds, and that these effects are concentrated within disadvantaged communities, with larger effects for non-whites and children of poor parents. I also find that pollution exposure during adolescence has statistically significant effects on high school non-completion and incarceration, but no effect on college attendance. These results suggest that the long term effects of pollution exposure on economic well-being may run through multiple channels, of which both non-cognitive skills and cognitive skills may play a role.
    Date: 2017–05
  4. By: Goller, Daniel; Knaus, Michael C.; Lechner, Michael; Okasa, Gabriel
    Abstract: We predict the probabilities for a draw, a home win, and an away win, for the games of the German Football Bundesliga (BL1) with a new machine-learning estimator using the (large) information available up to that date. We use these individual predictions in order to simulate a league table for every game day until the end of the season. This combination of a (stochastic) simulation approach with machine learning allows us to come up with statements about the likelihood that a particular team is reaching specific places in the final league table (i.e. champion, relegation, etc.). The machine-learning algorithm used, builds on a recent development of an Ordered Random Forest. This estimator generalises common estimators like ordered probit or ordered logit maximum likelihood and is able to recover essentially the same output as the standard estimators, such as the probabilities of the alternative conditional on covariates. The approach is already in use and results for the current season can be found at
    Keywords: Prediction, Machine Learning, Random Forest, Soccer, Bundesliga
    JEL: C53
    Date: 2018–11
  5. By: Jie Ding; Vahid Tarokh; Yuhong Yang
    Abstract: In the era of big data, analysts usually explore various statistical models or machine learning methods for observed data in order to facilitate scientific discoveries or gain predictive power. Whatever data and fitting procedures are employed, a crucial step is to select the most appropriate model or method from a set of candidates. Model selection is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction, and thus central to scientific studies in fields such as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a long history of model selection techniques that arise from researches in statistics, information theory, and signal processing. A considerable number of methods have been proposed, following different philosophies and exhibiting varying performances. The purpose of this article is to bring a comprehensive overview of them, in terms of their motivation, large sample performance, and applicability. We provide integrated and practically relevant discussions on theoretical properties of state-of- the-art model selection approaches. We also share our thoughts on some controversial views on the practice of model selection.
    Date: 2018–10
  6. By: Fuli Feng; Huimin Chen; Xiangnan He; Ji Ding; Maosong Sun; Tat-Seng Chua
    Abstract: This paper contributes a new machine learning solution for stock movement prediction, which aims to predict whether the price of a stock will be up or down in the near future. The key novelty is that we propose to employ adversarial training to improve the generalization of a recurrent neural network model. The rationality of adversarial training here is that the input features to stock prediction are typically based on stock price, which is essentially a stochastic variable and continuously changed with time by nature. As such, normal training with stationary price-based features (e.g. the closing price) can easily overfit the data, being insufficient to obtain reliable models. To address this problem, we propose to add perturbations to simulate the stochasticity of continuous price variable, and train the model to work well under small yet intentional perturbations. Extensive experiments on two real-world stock data show that our method outperforms the state-of-the-art solution with 3.11% relative improvements on average w.r.t. accuracy, verifying the usefulness of adversarial training for stock prediction task. Codes will be made available upon acceptance.
    Date: 2018–10
  7. By: Johannes Berens; Kerstin Schneider; Simon Görtz; Simon Oster; Julian Burghoff
    Abstract: To successfully reduce student attrition, it is imperative to understand what the underlying determinants of attrition are and which students are at risk of dropping out. We develop an early detection system (EDS) using administrative student data from a state and a private university to predict student success as a basis for a targeted intervention. The EDS uses regression analysis, neural networks, decision trees, and the AdaBoost algorithm to identify student characteristics which distinguish potential dropouts from graduates. Prediction accuracy at the end of the first semester is 79% for the state university and 85% for the private university of applied sciences. After the fourth semester, the accuracy improves to 90% for the state university and 95% for the private university of applied sciences.
    Keywords: student attrition, machine learning, administrative student data, AdaBoost
    JEL: I23 H42 C45
    Date: 2018
  8. By: Marc Sabate Vidales; David Siska; Lukasz Szpruch
    Abstract: We propose black-box-type control variate for Monte Carlo simulations by leveraging the Martingale Representation Theorem and artificial neural networks. We developed several learning algorithms for finding martingale control variate functionals both for the Markovian and non-Markovian setting. The proposed algorithms guarantee convergence to the true solution independently of the quality of the deep learning approximation of the control variate functional. We believe that this is important as the current theory of deep learning functions approximations lacks theoretical foundation. However the quality of the deep learning functional approximation determines the level of benefit of the control variate. The methods are empirically shown to work for high-dimensional problems. We provide diagnostics that shed light on appropriate network architectures.
    Date: 2018–10
  9. By: Chihli Hung (Chung Yuan Christian University)
    Abstract: This research proposes the preference vector modeling (PVM) to deal with polarity shifts for improvement of sentiment classification for word of mouth (WOM). WOM has become a main information resource of consumers while making business or buying strategies. A polarity shift happens when the sentiment polarity of a term is different from that of its associated WOM document, which is one of the most difficult issues in the field of opinion mining. Traditional opinion mining approaches depend on predefined sentiment polarities of terms to be accumulated as the WOM?s sentiment polarity or to be trained based on machine learning techniques, but ignore the significance of polarity shift due to some specific usage of terms. There are two kinds of approaches used for detection of polarity shifts in the literature, which are rule-based approaches and machine learning approaches. However, it is hard for a rule-based approach to manually define a complete rule set. The machine learning approach, which is based on the vector space model (VSM), suffers from the curse of dimensionality. Therefore, this research proposes a novel approach to deal with polarity shifts for sentiment analysis because of the weakness of existing research in the literature. Firstly, this research proposes PVM based on an integration of opinionated documents and a star ranking system. The dimensionality of preference vectors equals the number of the star ranking system. Thus, the proposed PVM overcomes the curse of dimensionality as the number of dimensionality of the star ranking system is much fewer than that of the document vector based on VSM. Then, the automatic approach for polarity shift detection is proposed. The document preference vector is represented based on the average vector of term preference vectors. This way is able to deal with opinionated documents if they are extracted from the same scale of the star ranking systems and the same domain. Finally, the integrated approach of PVM and some classification techniques is used for improvement of sentiment classification for word of mouth.
    Keywords: Polarity Shift; Preference Vector Modeling; Opinionated Text; Sentiment Analysis; Opinion Mining
    JEL: D80 L86
    Date: 2018–07
  10. By: Avraam Tsantekidis; Nikolaos Passalis; Anastasios Tefas; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
    Abstract: The recent surge in Deep Learning (DL) research of the past decade has successfully provided solutions to many difficult problems. The field of quantitative analysis has been slowly adapting the new methods to its problems, but due to problems such as the non-stationary nature of financial data, significant challenges must be overcome before DL is fully utilized. In this work a new method to construct stationary features, that allows DL models to be applied effectively, is proposed. These features are thoroughly tested on the task of predicting mid price movements of the Limit Order Book. Several DL models are evaluated, such as recurrent Long Short Term Memory (LSTM) networks and Convolutional Neural Networks (CNN). Finally a novel model that combines the ability of CNNs to extract useful features and the ability of LSTMs' to analyze time series, is proposed and evaluated. The combined model is able to outperform the individual LSTM and CNN models in the prediction horizons that are tested.
    Date: 2018–10
  11. By: Shenhao Wang; Jinhua Zhao
    Abstract: Deep neural network (DNN) has been increasingly applied to travel demand prediction. However, no study has examined how DNN relates to utility-based discrete choice models (DCM) beyond simple comparison of prediction accuracy. To fill this gap, this paper investigates the relationship between DNN and DCM from a theoretical perspective with three major findings. First, we introduce the utility interpretation to the DNN models and demonstrate that DCM is one special case of DNN with shallow and sparse architecture, identifiable parameters, logistic loss, zero regularization, and domain-knowledge based feature transformation. Second, a sequence of four neural network models illustrate how DNN gradually trade away interpretability for predictability in the context of travel mode choice. High predictability is achieved by DNN's powerful representation learning and high model capacity; but interpretability is sacrificed through the loss of convex optimization and statistical properties, and non-identification of parameters. Third, the utility interpretation allows us to develop a numerical method of extracting important economic information from DNN including choice probability, elasticity, marginal rate of substitution, and consumer surplus. Overall, this study makes three contributions: theoretically it frames DCM as a special case of DNN and introduces the utility interpretation to DNN; methodologically it demonstrates the interpretability-predictability tradeoff between DCM and DNN and suggests the potential of their joint improvement, and practically it introduces a post-hoc numerical method to extract economic information from DNN and make it interpretable through the utility concept.
    Date: 2018–10
  12. By: Ehsan Hoseinzade; Saman Haratizadeh
    Abstract: Feature extraction from financial data is one of the most important problems in market prediction domain for which many approaches have been suggested. Among other modern tools, convolutional neural networks (CNN) have recently been applied for automatic feature selection and market prediction. However, in experiments reported so far, less attention has been paid to the correlation among different markets as a possible source of information for extracting features. In this paper, we suggest a CNN-based framework with specially designed CNNs, that can be applied on a collection of data from a variety of sources, including different markets, in order to extract features for predicting the future of those markets. The suggested framework has been applied for predicting the next day's direction of movement for the indices of S&P 500, NASDAQ, DJI, NYSE, and RUSSELL markets based on various sets of initial features. The evaluations show a significant improvement in prediction's performance compared to the state of the art baseline algorithms.
    Date: 2018–10
  13. By: Youngjoon Lee (Yonsei University); Soohyon Kim (Bank of Korea); Ki Young Park (Yonsei University)
    Abstract: We quantify the Monetary Policy Committee (MPC) minutes of the Bank of Korea (BOK) using text mining approach. We propose a novel approach using a field-specific Korean dictionary and contiguous sequence of words (n-grams) to better capture the subtlety of central bank communication. We find that our lexicon-based indicators help explain the current and future BOK monetary policy decisions when considering an augmented Taylor rule, suggesting that they contain additional information beyond the currently available macroeconomic variables. Our indicators remarkably outper- form English-based textual classifications, a media-based measure of economic policy uncertainty, and a data-based measure of macroeconomic uncertainty. Our empirical re- sults also emphasize the importance of using a field-specific dictionary and the original Korean text.
    Keywords: monetary policy; text mining; central banking; Bank of Korea; Taylor rule
    JEL: E43 E52 E58
    Date: 2018–10
  14. By: Kfir Eliaz; Ran Spiegler
    Abstract: A "statistician" takes an action on behalf of an agent, based on the agent's self-reported personal data and a sample involving other people. The action that he takes is an estimated function of the agent's report. The estimation procedure involves model selection. We ask the following question: Is truth-telling optimal for the agent given the statistician's procedure? We analyze this question in the context of a simple example that highlights the role of model selection. We suggest that our simple exercise may have implications for the broader issue of human interaction with "machine learning" algorithms.
    Date: 2018–10

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.