nep-big New Economics Papers
on Big Data
Issue of 2018‒04‒09
eleven papers chosen by
Tom Coupé
University of Canterbury

  1. Big Data and Regional Science: Opportunities, Challenges, and Directions for Future Research By Schintler, Laurie A.; Fischer, Manfred M.
  2. Machine Learning Indices, Political Institutions, and Economic Development By Klaus Gründler; Tommy Krieger
  3. Evaluating Conditional Cash Transfer Policies with Machine Learning Methods By Tzai-Shuen Chen
  4. Credit Risk Analysis using Machine and Deep learning models By Peter Martey Addo; Dominique Guegan; Bertrand Hassani
  5. Machine Learning with Screens for Detecting Bid-Rigging Cartels By Huber, Martin; Imhof, David
  6. Climate Adaptive Response Estimation: Short And Long Run Impacts Of Climate Change On Residential Electricity and Natural Gas Consumption Using Big Data By Maximilian Auffhammer
  7. Universal features of price formation in financial markets: perspectives from Deep Learning By Justin Sirignano; Rama Cont
  8. In-hospital Mortality Prediction for Trauma Patients Using Cost-sensitive MedLDA By Haruya Ishizuka; Tsukasa Ishigaki; Naoya Kobayashi; Daisuke Kudo; Atsuhiro Nakagawa
  9. Man vs Robots? Future Challenges and Opportunities within Artificial Intelligence Health Care Education Model By Shanel Lu; Sharon L. Burton
  10. Adversarial Generalized Method of Moments By Greg Lewis; Vasilis Syrgkanis
  11. Categorizing Variants of Goodhart's Law By David Manheim; Scott Garrabrant

  1. By: Schintler, Laurie A.; Fischer, Manfred M.
    Abstract: Recent technological, social, and economic trends and transformations are contributing to the production of what is usually referred to as Big Data. Big Data, which is typically defined by four dimensions -- Volume, Velocity, Veracity, and Variety -- changes the methods and tactics for using, analyzing, and interpreting data, requiring new approaches for data provenance, data processing, data analysis and modeling, and knowledge representation. The use and analysis of Big Data involves several distinct stages from "data acquisition and recording" over "information extraction" and "data integration" to "data modeling and analysis" and "interpretation", each of which introduces challenges that need to be addressed. There also are cross-cutting challenges, which are common challenges that underlie many, sometimes all, of the stages of the data analysis pipeline. These relate to "heterogeneity", "uncertainty", "scale", "timeliness", "privacy" and "human interaction". Using the Big Data analysis pipeline as a guiding framework, this paper examines the challenges arising in the use of Big Data in regional science. The paper concludes with some suggestions for future activities to realize the possibilities and potential for Big Data in regional science.
    Keywords: Spatial Big Data, data analysis pipeline, methodological and technical challenges, cross-cutting challenges, regional science
    Date: 2018
  2. By: Klaus Gründler; Tommy Krieger
    Abstract: We present a new aggregation method - called SVM algorithm - and use this technique to produce novel measures of democracy (186 countries, 1960-2014). The method takes its name from a machine learning technique for pattern recognition and has three notable features: it makes functional assumptions unnecessary, it accounts for measurement uncertainty, and it creates continuous and dichotomous indices. We use the SVM indices to investigate the effect of democratic institutions on economic development, and find that democracies grow faster than autocracies. Furthermore, we illustrate how the estimation results are affected by conceptual and methodological changes in the measure of democracy. In particular, we show that instrumental variables cannot compensate for measurement errors produced by conventional aggregation methods, and explain why this failure leads to an overestimation of regression coefficients.
    Keywords: democracy, development, economic growth, estimation bias, indices, institutions, machine learning, support vector machines
    JEL: C26 C43 N40 O10 P16 P48
    Date: 2018
  3. By: Tzai-Shuen Chen
    Abstract: This paper presents an out-of-sample prediction comparison between major machine learning models and the structural econometric model. Over the past decade, machine learning has established itself as a powerful tool in many prediction applications, but this approach is still not widely adopted in empirical economic studies. To evaluate the benefits of this approach, I use the most common machine learning algorithms, CART, C4.5, LASSO, random forest, and adaboost, to construct prediction models for a cash transfer experiment conducted by the Progresa program in Mexico, and I compare the prediction results with those of a previous structural econometric study. Two prediction tasks are performed in this paper: the out-of-sample forecast and the long-term within-sample simulation. For the out-of-sample forecast, both the mean absolute error and the root mean square error of the school attendance rates found by all machine learning models are smaller than those found by the structural model. Random forest and adaboost have the highest accuracy for the individual outcomes of all subgroups. For the long-term within-sample simulation, the structural model has better performance than do all of the machine learning models. The poor within-sample fitness of the machine learning model results from the inaccuracy of the income and pregnancy prediction models. The result shows that the machine learning model performs better than does the structural model when there are many data to learn; however, when the data are limited, the structural model offers a more sensible prediction. The findings of this paper show promise for adopting machine learning in economic policy analyses in the era of big data.
    Date: 2018–03
  4. By: Peter Martey Addo (Expert Synapses SNCF Mobilité; LabEx ReFi); Dominique Guegan (University Paris 1 Pantheon Sorbonne; Ca' Foscari Unversity Venice; IPAG Business School; LabEx ReFi); Bertrand Hassani (Capgemini Consulting; LabEx ReFi)
    Abstract: Due to the hyper technology associated to Big Data, data availability and computing power, most banks or lending financial institutions are renewing their business models. Credit risk predictions, monitoring, model reliability and effective loan processing are key to decision making and transparency. In this work, we build binary classifiers based on machine and deep learning models on real data in predicting loan default probability. The top 10 important features from these models are selected and then used in the modelling process to test the stability of binary classifiers by comparing performance on separate data. We observe that tree-based models are more stable than models based on multilayer artificial neural networks. This opens several questions relative to the intensive used of deep learning systems in the enterprises.
    Keywords: Credit risk, Financial regulation, Data Science, Bigdata, Deep learning
    Date: 2018
  5. By: Huber, Martin; Imhof, David
    Abstract: We combine machine learning techniques with statistical screens computed from the distribution of bids in tenders within the Swiss construction sector to predict collusion through bid-rigging cartels. We assess the out of sample performance of this approach and find it to correctly classify more than 80% of the total of bidding processes as collusive or non-collusive. As the correct classification rate, however, differs across truly non-collusive and collusive processes, we also investigate tradeoffs in reducing false positive vs. false negative predictions. Finally, we discuss policy implications of our method for competition agencies aiming at detecting bid-rigging cartels.
    Keywords: Bid rigging detection; screening methods; variance screen; cover bidding screen; structural and behavioural screens; machine learning; lasso; ensemble methods
    JEL: C21 C45 C52 D22 D40 K40
    Date: 2018–03–29
  6. By: Maximilian Auffhammer
    Abstract: This paper proposes a simple two-step estimation method (Climate Adaptive Response Estimation - CARE) to estimate sectoral climate damage functions, which account for long- run adaptation. The paper applies this method in the context of residential electricity and natural gas demand for the world's sixth largest economy - California. The advantage of the proposed method is that it only requires detailed information on intensive margin behavior, yet does not require explicit knowledge of the extensive margin response (e.g., technology adoption). Using almost two billion energy bills, we estimate spatially highly disaggregated intensive margin temperature response functions using daily variation in weather. In a second step, we explain variation in the slopes of the dose response functions across space as a function of summer climate. Using 18 state-of-the-art climate models, we simulate future demand by letting households vary consumption along the intensive and extensive margins. We show that failing to account for extensive margin adjustment in electricity demand leads to a significant underestimate of the future impacts on electricity consumption. We further show that reductions in natural gas demand more than offset any climate-driven increases in electricity consumption in this context.
    JEL: Q4 Q54
    Date: 2018–03
  7. By: Justin Sirignano; Rama Cont
    Abstract: Using a large-scale Deep Learning approach applied to a high-frequency database containing billions of electronic market quotes and transactions for US equities, we uncover nonparametric evidence for the existence of a universal and stationary price formation mechanism relating the dynamics of supply and demand for a stock, as revealed through the order book, to subsequent variations in its market price. We assess the model by testing its out-of-sample predictions for the direction of price moves given the history of price and order flow, across a wide range of stocks and time periods. The universal price formation model is shown to exhibit a remarkably stable out-of-sample prediction accuracy across time, for a wide range of stocks from different sectors. Interestingly, these results also hold for stocks which are not part of the training sample, showing that the relations captured by the model are universal and not asset-specific. The universal model --- trained on data from all stocks --- outperforms, in terms of out-of-sample prediction accuracy, asset-specific linear and nonlinear models trained on time series of any given stock, showing that the universal nature of price formation weighs in favour of pooling together financial data from various stocks, rather than designing asset- or sector-specific models as commonly done. Standard data normalizations based on volatility, price level or average spread, or partitioning the training data into sectors or categories such as large/small tick stocks, do not improve training results. On the other hand, inclusion of price and order flow history over many past observations is shown to improve forecasting performance, showing evidence of path-dependence in price dynamics.
    Date: 2018–03
  8. By: Haruya Ishizuka; Tsukasa Ishigaki; Naoya Kobayashi; Daisuke Kudo; Atsuhiro Nakagawa
    Abstract: In intensive care units (ICUs), mortality prediction using vital sign or demographics of patients yields helpful information to support the decision-making of intensivists. Clinical texts recorded by medical staff tend to be valuable for prediction. However, text data are not applicable to outcome prediction of the regression framework in a direct way. In addition, learning of prediction models of such outcomes is a class of imbalanced data problem because the number of survivors is greater than the number of dead patients in most ICUs. To address these difficulties, we present Cost-Sensitive MedLDA: a supervised topic model employing cost-sensitive learning. The model realizes a prediction model from heterogeneous data such as vital signs, demographic information, and clinical text in an imbalanced class problem. Through experimentation and discussion, we demonstrate that the model has two benefits for use in medical fields: 1) our model has high prediction performance for minority instances while maintaining good performance for majority instances even if the training set is imbalanced data; 2) our model can reveal some characteristics that are associated with bad outcomes from the use of clinical texts.
    Date: 2018–03
  9. By: Shanel Lu (EmergenceAI); Sharon L. Burton (SLBurtonConsulting)
    Abstract: This study investigated the need to provide a formal artificial intelligence (AI) health care education model to the 21th century AI health care learners. Health care has continuously transformed at all levels of health care administrative, operational, and practical. This vastly changing health care industry requires a synthesis of communicating multifaceted and diverse forms of thinking. AI health care professional entities within business, technology, art, biomedical, and other health care related sectors must work cross-functionally to establish roles that will meet the need toward improving health care at all levels. In order to achieve this pursuit, we researched and investigated how to create an AI health care education model fostering collaboration and innovation. There has been a significant calling for AI heath care collaboration of academicians, clinical scientist, and health care practitioners of all levels to identify a comprehensive AI health care education model due to the current void in the health care course design. To further this empirical study, the researchers focused on a qualitative study comprising of qualitative interviews and surveys inviting participants from the AI health care, business, biomedical, clinical scientist, academia, and capital investors to expound on the level of significance each professional sector have toward AI Health care education.
    Keywords: Health Care technology solutions, Health care Technology solutions education, Incubator Clinical Hours, Emergence AI Curriculum Design Mode
    Date: 2017
  10. By: Greg Lewis; Vasilis Syrgkanis
    Abstract: We provide an approach for learning deep neural net representations of models described via conditional moment restrictions. Conditional moment restrictions are widely used, as they are the language by which social scientists describe the assumptions they make to enable causal inference. We formulate the problem of estimating the underling model as a zero-sum game between a modeler and an adversary and apply adversarial training. Our approach is similar in nature to Generative Adversarial Networks (GAN), though here the modeler is learning a representation of a function that satisfies a continuum of moment conditions and the adversary is identifying violating moments. We outline ways of constructing effective adversaries in practice, including kernels centered by k-means clustering, and random forests. We examine the practical performance of our approach in the setting of non-parametric instrumental variable regression.
    Date: 2018–03
  11. By: David Manheim; Scott Garrabrant
    Abstract: There are several distinct failure modes for overoptimization of systems on the basis of metrics. This occurs when a metric which can be used to improve a system is used to an extent that further optimization is ineffective or harmful, and is sometimes termed Goodhart's Law. This class of failure is often poorly understood, partly because terminology for discussing them is ambiguous, and partly because discussion using this ambiguous terminology ignores distinctions between different failure modes of this general type. This paper expands on an earlier discussion by Garrabrant, which notes there are "(at least) four different mechanisms" that relate to Goodhart's Law. This paper is intended to explore these mechanisms further, and specify more clearly how they occur. This discussion should be helpful in better understanding these types of failures in economic regulation, in public policy, in machine learning, and in Artificial Intelligence alignment. The importance of Goodhart effects depends on the amount of power directed towards optimizing the proxy, and so the increased optimization power offered by artificial intelligence makes it especially critical for that field.
    Date: 2018–03

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.