nep-big New Economics Papers
on Big Data
Issue of 2018‒01‒08
eight papers chosen by
Tom Coupé
University of Canterbury

  1. Dynamic competition in deceptive markets By JOHNEN, Johannes
  2. Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments By Victor Chernozhukov; Mert Demirer; Esther Duflo; Ivan Fernandez-Val
  3. Hedonic Recommendations: An Econometric Application on Big Data By Okay Gunes
  4. Targeting policy-compliers with machine learning: an application to a tax rebate programme in Italy By Monica Andini; Emanuele Ciani; Guido de Blasio; Alessio D'Ignazio; Viola Salvestrini
  5. Hospital Readmission is Highly Predictable from Deep Learning By Damien Échevin; Qing Li; Marc-André Morin
  6. Learning Objectives for Treatment Effect Estimation By Xinkun Nie; Stefan Wager
  7. Machine Learning for Partial Identification: Example of Bracketed Data By Vira Semenova
  8. Dengue Spread Modeling in the Absence of Sufficient Epidemiological Parameters: Comparison of SARIMA and SVM Time Series Models By Jerelyn Co; Jason Allan Tan; Ma. Regina Justina Estuar; Kennedy Espina

  1. By: JOHNEN, Johannes (CORE, Université catholique de Louvain)
    Abstract: In many deceptive markets, firms design contracts to exploit mistakes of naive consumers. These contracts also attract less profitable sophisticated consumers. I study such markets when firms compete repeatedly and gather usage data about their customers which is informative about the likelihood of a customer being sophisticated. I show in a benchmark model that firms do not benefit from private information in this setting when all consumers are rational. I find that in sharp contrast to a model with only rational consumers, this customer information mitigates competition and is of great value to its owner despite intense competition. I discuss several implications of the value of customer information on naiveté. Private information on customers’ sophistication induces profits that are bell-shaped in the share of naive consumers. Firms prefer an even mix of both customer types. I also show that if firms can educate (some) naives about hidden fees, competition is already mitigated when firms compete for customers with initially symmetric information. I analyze a policy that discloses customer information to all firms and thereby increases consumer surplus. I discuss how the UK governments’ midata program might induce crucial aspects of this policy, and illustrate the obustness of results through several extensions.
    Keywords: Consumer mistakes, deceptive products, shrouded attributes, big data, targeted pricing, consumer data, add-on pricing, price discrimination, industry dynamics
    JEL: C22 C58
    Date: 2017–12–20
  2. By: Victor Chernozhukov; Mert Demirer; Esther Duflo; Ivan Fernandez-Val
    Abstract: We propose strategies to estimate and make inference on key features of heterogeneous effects in randomized experiments. These key features include best linear predictors of the effects using machine learning proxies, average effects sorted by impact groups, and average characteristics of most and least impacted units. The approach is valid in high dimensional settings, where the effects are proxied by machine learning methods. We post-process these proxies into the estimates of the key features. Our approach is agnostic about the properties of the machine learning estimators used to produce proxies, and it completely avoids making any strong assumption. Estimation and inference relies on repeated data splitting to avoid overfitting and achieve validity. Our variational inference method is shown to be uniformly valid and quantifies the uncertainty coming from both parameter estimation and data splitting. In essence, we take medians of p-values and medians of confidence intervals, resulting from many different data splits, and then adjust their nominal level to guarantee uniform validity. The inference method could be of substantial independent interest in many machine learning applications. Empirical applications illustrate the use of the approach.
    Date: 2017–12
  3. By: Okay Gunes (Centre d'Economie de la Sorbonne)
    Abstract: This work will demonstrate how economic theory can be applied to big data analysis. To do this, I propose two layers of machine learning that use econometric models introduced into a recommender system. The reason for doing so is to challenge traditional recommendation approaches. These approaches are inherently biased due to the fact that they ignore the final preference order for each individual and under-specify the interaction between the socio-economic characteristics of the participants and the characteristics of the commodities in question. In this respect, our hedonic recommendation approach proposes to first correct the internal preferences with respect to the tastes of each individual under the characteristics of given products. In the second layer, the relative preferences across participants are predicted by socio-economic characteristics. The robustness of the model is tested with the MovieLens (100k data consists of 943 users over 1682 movies) run by GroupLens. Our methodology shows the importance and the necessity of correcting the data set by using economic theory. This methodology can be applied for all recommender systems using ratings based on consumer decisions
    Keywords: Big Data; Python; R; Machine learning; Recommendation Engine; Econometrics
    JEL: C01 C80
    Date: 2017–12
  4. By: Monica Andini (Bank of Italy); Emanuele Ciani (Bank of Italy); Guido de Blasio (Bank of Italy); Alessio D'Ignazio (Bank of Italy); Viola Salvestrini (London School of Economics and Political Science)
    Abstract: Machine Learning (ML) can be a powerful tool to inform policy decisions. Those who are treated under a programme might have different propensities to put into practice the behaviour that the policymaker wants to incentivize. ML algorithms can be used to predict the policy-compliers; that is, those who are most likely to behave in the way desired by the policymaker. When the design of the programme is tailored to target the policy-compliers, the overall effectiveness of the policy is increased. This paper proposes an application of ML targeting that uses the massive tax rebate scheme introduced in Italy in 2014.
    Keywords: machine learning, prediction, programme evaluation, fiscal stimulus
    JEL: C5 H3
    Date: 2017–12
  5. By: Damien Échevin; Qing Li; Marc-André Morin
    Abstract: Hospital readmission is costly and existing models are often poor or moderate in predicting readmission. We sought to develop and test a method that can be applied generally by hospitals. Such a tool can help clinicians identify patients who are more likely to be readmitted, either at early stages of hospital stay or at hospital discharge. Relying on state-of-the art machine learning algorithms, we predict probability of 30-day readmission at hospital admission and at hospital discharge using administrative data on 1,633,099 hospital stays from Quebec between 1995 and 2012. We measure performance of the predictions with the area under receiver operating characteristic curve (AUC). Deep Learning produced excellent prediction of readmission province-wide, and Random Forest reached very similar level. The AUC for these two algorithms reached above 78% at hospital admission and above 87% at hospital discharge, and the diagnostic codes are among the most predictive variables. The ease of implementation of machine learning algorithms, together with objectively validated reliability, brings new possibilities for cost reduction in the health care system.
    Keywords: Machine learning; Logistic regression; Risk of re-hospitalisation; Healthcare costs
    JEL: I10 C52
    Date: 2017
  6. By: Xinkun Nie; Stefan Wager
    Abstract: We develop a general class of two-step algorithms for heterogeneous treatment effect estimation in observational studies. We first estimate marginal effects and treatment propensities to form an objective function that isolates the heterogeneous treatment effects, and then optimize the learned objective. This approach has several advantages over existing methods. From a practical perspective, our method is very flexible and easy to use: In both steps, we can use any method of our choice, e.g., penalized regression, a deep net, or boosting; moreover, these methods can be fine-tuned by cross-validating on the learned objective. Meanwhile, in the case of penalized kernel regression, we show that our method has a quasi-oracle property, whereby even if our pilot estimates for marginal effects and treatment propensities are not particularly accurate, we achieve the same regret bounds as an oracle who has a-priori knowledge of these nuisance components. We implement variants of our method based on both penalized regression and convolutional neural networks, and find promising performance relative to existing baselines.
    Date: 2017–12
  7. By: Vira Semenova
    Abstract: Partially identified models occur commonly in economic applications. A common problem in this literature is a regression problem with bracketed (interval-censored) outcome variable Y, which creates a set-identified parameter of interest. The recent studies have only considered finite-dimensional linear regression in such context. To incorporate more complex controls into the problem, we consider a partially linear projection of Y on the set functions that are linear in treatment/policy variables and nonlinear in the controls. We characterize the identified set for the linear component of this projection and propose an estimator of its support function. Our estimator converges at parametric rate and has asymptotic normality properties. It may be useful for labor economics applications that involve bracketed salaries and rich, high-dimensional demographic data about the subjects of the study.
    Date: 2017–12
  8. By: Jerelyn Co (Ateneo de Manila University); Jason Allan Tan (Ateneo de Manila University); Ma. Regina Justina Estuar (Ateneo de Manila University); Kennedy Espina (Ateneo de Manila University)
    Abstract: Dengue remains to be a major public health concern in the Philippines, claiming hundreds of lives every year. Given limited data for deriving necessary epidemiological parameters in developing deterministic disease models, forecasting as a means in controlling and anticipating outbreaks remains a challenge. In this study, two time series models, namely Seasonal Autoregressive Integrated Moving Average and Support Vector Machine, were developed without the requirement for prior epidemiological parameters. Performances of the models in predicting dengue incidences in the Western Visayas Region of the Philippines were compared by measuring the Root Mean Square Error and Mean Average Error. Results showed that the models were both effective in forecasting Dengue incidences for epidemiological surveillance as validated by historical data. SARIMA model yielded average RMSE and MAE scores of 16.8187 and 11.4640, respectively. Meanwhile, SVM model achieved scores of 11.8723 and 7.7369, respectively. With the data and setup used, this study showed that SVM outperformed SARIMA in forecasting Dengue incidences. Furthermore, preliminary investigation of one-month lagged climate variables using Random Forest Regressor’s feature ranking yielded rain intensity and value as top possible dengue incidence climate predictors
    Keywords: SARIMA, SVM, Dengue Fever, Time Series Modeling, Feature importance
    Date: 2017

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.