nep-big 2017-10-01 papers

on Big Data

Issue of 2017‒10‒01
seven papers chosen by
Tom Coupé
University of Canterbury

By:	Ye Luo; Martin Spindler
Abstract:	In the recent years more and more high-dimensional data sets, where the number of parameters $p$ is high compared to the number of observations $n$ or even larger, are available for applied researchers. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years and are suitable for the analysis of such data sets. While Lasso has been applied very successfully for high-dimensional data sets in Economics, boosting has been underutilized in this field, although it has been proven very powerful in fields like Biostatistics and Pattern Recognition. We attribute this to missing theoretical results for boosting. The goal of this paper is to fill this gap and show that boosting is a competitive method for inference of a treatment effect or instrumental variable (IV) estimation in a high-dimensional setting. First, we present the $L_2$Boosting with componentwise least squares algorithm and variants which are tailored for regression problems which are the workhorse for most Econometric problems. Then we show how $L_2$Boosting can be used for estimation of treatment effects and IV estimation. We highlight the methods and illustrate them with simulations and empirical examples. For further results and technical details we refer to Luo and Spindler (2016, 2017) and to the online supplement of the paper.
Date:	2017–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1702.03244&r=big

Planning Ahead for Better Neighborhoods: Long Run Evidence from Tanzania

By:	Baruah, Neeraj; Dahlstrand-Rudin, Amanda; Michaels, Guy; Nigmatulina, Dzhamilya; Rauch, Ferdinand; Regan, Tanner
Abstract:	What are the long run consequences of planning and providing basic infrastructure in neighborhoods, where people build their own homes? We study "Sites and Services" projects implemented in seven Tanzanian cities during the 1970s and 1980s, half of which provided infrastructure in previously unpopulated areas (de novo neighborhoods), while the other half upgraded squatter settlements. Using satellite images and surveys from the 2010s, we find that de novo neighborhoods developed better housing than adjacent residential areas (control areas) that were also initially unpopulated. Specifically, de novo neighborhood are more orderly and their buildings have larger footprint areas and are more likely to have multiple stories, as well as connections to electricity and water, basic sanitation and access to roads. And though de novo neighborhoods generally attracted better educated residents than control areas, the educational difference is too small to account for the large difference in residential quality that we find. While we have no natural counterfactual for the upgrading areas, descriptive evidence suggests that they are if anything worse than the control areas.
Keywords:	Africa; economic development; Slums; Urban Economics
JEL:	O18 R14 R31
Date:	2017–09
URL:	http://d.repec.org/n?u=RePEc:cpr:ceprdp:12319&r=big

High-Dimensional Metrics in R

By:	Victor Chernozhukov; Chris Hansen; Martin Spindler
Abstract:	The package High-dimensional Metrics (\Rpackage{hdm}) is an evolving collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e.g., treatment or policy variable) in a high-dimensional approximately sparse regression model, for average treatment effect (ATE) and average treatment effect for the treated (ATET), as well for extensions of these parameters to the endogenous setting are provided. Theory grounded, data-driven methods for selecting the penalization parameter in Lasso regressions under heteroscedastic and non-Gaussian errors are implemented. Moreover, joint/ simultaneous confidence intervals for regression coefficients of a high-dimensional sparse regression are implemented, including a joint significance test for Lasso regression. Data sets which have been used in the literature and might be useful for classroom demonstration and for testing new estimators are included. \R and the package \Rpackage{hdm} are open-source software projects and can be freely downloaded from CRAN: \texttt{http://cran.r-project.org}.
Date:	2016–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1603.01700&r=big

Model Averaging and its Use in Economics

By:	Steel, Mark F. J.
Abstract:	The method of model averaging has become an important tool to deal with model uncertainty, in particular in empirical settings with large numbers of potential models and relatively limited numbers of observations, as are common in economics. Model averaging is a natural response to model uncertainty in a Bayesian framework, so most of the paper deals with Bayesian model averaging. In addition, frequentist model averaging methods are also discussed. Numerical methods to implement these methods are explained, and I point the reader to some freely available computational resources. The main focus is on the problem of variable selection in linear regression models, but the paper also discusses other, more challenging, settings. Some of the applied literature is reviewed with particular emphasis on applications in economics. The role of the prior assumptions in Bayesian procedures is highlighted, and some recommendations for applied users are provided
Keywords:	Bayesian methods; Model uncertainty; Normal linear model; Prior specification; Robustness
JEL:	C11 C15 C20 C52 O47
Date:	2017–09–19
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:81568&r=big

Machine Learning Tests for Effects on Multiple Outcomes

By:	Jens Ludwig; Sendhil Mullainathan; Jann Spiess
Abstract:	A core challenge in the analysis of experimental data is that the impact of some intervention is often not entirely captured by a single, well-defined outcome. Instead there may be a large number of outcome variables that are potentially affected and of interest. In this paper, we propose a data-driven approach rooted in machine learning to the problem of testing effects on such groups of outcome variables. It is based on two simple observations. First, the 'false-positive' problem that a group of outcomes is similar to the concern of 'over-fitting,' which has been the focus of a large literature in statistics and computer science. We can thus leverage sample-splitting methods from the machine-learning playbook that are designed to control over-fitting to ensure that statistical models express generalizable insights about treatment effects. The second simple observation is that the question whether treatment affects a group of variables is equivalent to the question whether treatment is predictable from these variables better than some trivial benchmark (provided treatment is assigned randomly). This formulation allows us to leverage data-driven predictors from the machine-learning literature to flexibly mine for effects, rather than rely on more rigid approaches like multiple-testing corrections and pre-analysis plans. We formulate a specific methodology and present three kinds of results: first, our test is exactly sized for the null hypothesis of no effect; second, a specific version is asymptotically equivalent to a benchmark joint Wald test in a linear regression; and third, this methodology can guide inference on where an intervention has effects. Finally, we argue that our approach can naturally deal with typical features of real-world experiments, and be adapted to baseline balance checks.
Date:	2017–07
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1707.01473&r=big

Optimal Data Collection for Randomized Control Trials

By:	Pedro Carneiro; Sokbae Lee; Daniel Wilhelm
Abstract:	In a randomized control trial, the precision of an average treatment effect estimator can be improved either by collecting data on additional individuals, or by collecting additional covariates that predict the outcome variable. We propose the use of pre-experimental data such as a census, or a household survey, to inform the choice of both the sample size and the covariates to be collected. Our procedure seeks to minimize the resulting average treatment effect estimator's mean squared error, subject to the researcher's budget constraint. We rely on a modification of an orthogonal greedy algorithm that is conceptually simple and easy to implement in the presence of a large number of potential covariates, and does not require any tuning parameters. In two empirical applications, we show that our procedure can lead to substantial gains of up to 58%, measured either in terms of reductions in data collection costs or in terms of improvements in the precision of the treatment effect estimator.
Date:	2016–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1603.03675&r=big

High-Dimensional $L_2$Boosting: Rate of Convergence

By:	Ye Luo; Martin Spindler
Abstract:	Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of $L_2$Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called \textquotedblleft post-Boosting\textquotedblright. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by $L_2$Boosting. Another variant is \textquotedblleft Orthogonal Boosting\textquotedblright\ where after each step an orthogonal projection is conducted. We show that both post-$L_2$Boosting and the orthogonal boosting achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. We show that the rate of convergence of the classical $L_2$Boosting depends on the design matrix described by a sparse eigenvalue constant. To show the latter results, we derive new approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of $L_2$Boosting. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Our results also allow a direct comparison between LASSO and boosting which has been missing from the literature. Finally, we present simulation studies and applications to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, post-$L_2$Boosting clearly outperforms LASSO.
Date:	2016–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1602.08927&r=big

This nep-big issue is ©2017 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.