nep-ecm 2021-09-06 papers

on Econometrics

Issue of 2021‒09‒06
nineteen papers chosen by
Sune Karlsson
Örebro universitet

Wild Bootstrap for Instrumental Variables Regressions with Weak and Few Clusters By Wenjie Wang; Yichong Zhang
Revisiting Event Study Designs: Robust and Efficient Estimation By Kirill Borusyak; Xavier Jaravel; Jann Spiess
Fast cluster bootstrap methods for linear regression models By James G. MacKinnon
Robust Bayesian Analysis for Econometrics By Raffaella Giacomini; Toru Kitagawa; Matthew Read
Forward variable selection for ultra-high dimensional quantile regression models By Honda, Toshio; Lin, Chien-Tong
Extreme Conditional Expectile Estimation in Heavy-Tailed Heteroscedastic Regression Models By Stéphane Girard; Gilles Claude Stupfler; Antoine Usseglio-Carleve
Bandwidth Selection for Nonparametric Regression with Errors-in-Variables By Hao Dong; Taisuke Otsu; Luke Taylor
Biases on variances estimated on large data-sets By François Gardes
Accounting for Spatial Autocorrelation in Algorithm-Driven Hedonic Models: A Spatial Cross-Validation Approach By Juergen Deppner; Marcelo Cajias; Wolfgang Schäfers
Self-fulfilling Bandits: Endogeneity Spillover and Dynamic Selection in Algorithmic Decision-making By Jin Li; Ye Luo; Xiaowei Zhang
Assessing partial association between ordinal variables: quantification, visualization, and hypothesis testing By Liu, Dungang; Li, Shaobo; Yu, Yan; Moustaki, Irini
Exploring volatility of crude oil intra-day return curves: a functional GARCH-X Model By Rice, Gregory; Wirjanto, Tony; Zhao, Yuqian
On Extending Stochastic Dominance Comparisons to Ordinal Variables and Generalising Hammond Dominance By Gordon John Anderson; Teng Wah Leo
Dynamic relationship between Stock and Bond returns: A GAS MIDAS copula approach By Nguyen, Hoang; Javed, Farrukh
Cherry Picking By Lang, Megan; Qiu, Wenfeng
Robust PCA Synthetic Control By Mani Bayani
Evaluation of technology clubs by clustering: A cautionary note By Andres, Antonio Rodriguez; Otero, Abraham; Amavilah, Voxi Heinrich
Peeking inside the Black Box: Interpretable Machine Learning and Hedonic Rental Estimation By Marcelo Cajias; Willwersch Jonas; Lorenz Felix; Franz Fuerst
On the interpretation of black-box default prediction models: an Italian Small and Medium Enterprises case By Lisa Crosato; Caterina Liberati; Marco Repetto

Wild Bootstrap for Instrumental Variables Regressions with Weak and Few Clusters

By:	Wenjie Wang; Yichong Zhang
Abstract:	We study the wild bootstrap inference for instrumental variable (quantile) regressions in the framework of a small number of large clusters, in which the number of clusters is viewed as fixed and the number of observations for each cluster diverges to infinity. For subvector inference, we show that the wild bootstrap Wald test with or without using the cluster-robust covariance matrix controls size asymptotically up to a small error as long as the parameters of endogenous variables are strongly identified in at least one of the clusters. We further develop a wild bootstrap Anderson-Rubin (AR) test for full-vector inference and show that it controls size asymptotically up to a small error even under weak or partial identification for all clusters. We illustrate the good finite-sample performance of the new inference methods using simulations and provide an empirical application to a well-known dataset about U.S. local labor markets.
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2108.13707&r=

Revisiting Event Study Designs: Robust and Efficient Estimation

By:	Kirill Borusyak; Xavier Jaravel; Jann Spiess
Abstract:	A broad empirical literature uses "event study," or "difference-in-differences with staggered rollout," research designs for treatment effect estimation: settings in which units in the panel receive treatment at different times. We show a series of problems with conventional regression-based two-way fixed effects estimators, both static and dynamic. These problems arise when researchers conflate the identifying assumptions of parallel trends and no anticipatory effects, implicit assumptions that restrict treatment effect heterogeneity, and the specification of the estimand as a weighted average of treatment effects. We then derive the efficient estimator robust to treatment effect heterogeneity for this setting, show that it has a particularly intuitive "imputation" form when treatment-effect heterogeneity is unrestricted, characterize its asymptotic behavior, provide tools for inference, and illustrate its attractive properties in simulations. We further discuss appropriate tests for parallel trends, and show how our estimation approach extends to many settings beyond standard event studies.
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2108.12419&r=

Fast cluster bootstrap methods for linear regression models

By:	James G. MacKinnon (Queen's University)
Abstract:	Efficient computational algorithms for bootstrapping linear regression models with clustered data are discussed. For OLS regression, a new algorithm is provided for the pairs cluster bootstrap, and two algorithms for the wild cluster bootstrap are compared. One of these is a new way to express an existing algorithm, and the other is new. For IV regression, an algorithm is provided for the wild restricted efficient cluster (WREC) bootstrap, which up to now has been computationally burdensome for large samples. These algorithms are remarkably fast because all computations are based on matrices and vectors that contain sums over the observations within each cluster, which have to be computed just once before the bootstrap loop begins. Monte Carlo experiments are used to study the finite\tkk-sample properties of bootstrap Wald tests for OLS regression and of WREC bootstrap tests for IV regression.
Keywords:	clustered data, cluster-robust variance estimator, CRVE, robust inference, wild cluster bootstrap, WCR bootstrap, pairs cluster bootstrap, wild restricted efficient cluster bootstrap, WREC bootstrap, bootstrap Wald test
JEL:	C12 C15 C21 C23
Date:	2021–09
URL:	http://d.repec.org/n?u=RePEc:qed:wpaper:1465&r=

Robust Bayesian Analysis for Econometrics

By:	Raffaella Giacomini; Toru Kitagawa; Matthew Read
Abstract:	We review the literature on robust Bayesian analysis as a tool for global sensitivity analysis and for statistical decision-making under ambiguity. We discuss the methods proposed in the literature, including the different ways of constructing the set of priors that are the key input of the robust Bayesian analysis. We consider both a general set-up for Bayesian statistical decisions and inference and the special case of set-identified structural models. We provide new results that can be used to derive and compute the set of posterior moments for sensitivity analysis and to compute the optimal statistical decision under multiple priors. The paper ends with a self-contained discussion of three different approaches to robust Bayesian inference for set-identified structural vector autoregressions, including details about numerical implementation and an empirical illustration.
Keywords:	ambiguity; Bayesian robustness; statistical decision theory; identifying restrictions; multiple priors; structural vector autoregression
JEL:	C11 C18 C52
Date:	2021–08–23
URL:	http://d.repec.org/n?u=RePEc:fip:fedhwp:93001&r=

Forward variable selection for ultra-high dimensional quantile regression models

By:	Honda, Toshio; Lin, Chien-Tong
Abstract:	We propose forward variable selection procedures with a stopping rule for feature screening in ultra-high dimensional quantile regression models. For such very large models, penalized methods do not work and some preliminary feature screening is necessary. We demonstrate the desirable theoretical properties of our forward procedures by taking care of uniformity w.r.t. subsets of covariates properly. The necessity of such uniformity is often overlooked in the literature . Our stopping rule suitably incorporates the model size at each stage. We also present the results of simulation studies and a real data application to show their good finite sample performances.
Keywords:	forward procedure, check function, sparsity, screening consistency, stopping rule
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:hit:econdp:2021-02&r=

Extreme Conditional Expectile Estimation in Heavy-Tailed Heteroscedastic Regression Models

By:	Stéphane Girard (LJK - Laboratoire Jean Kuntzmann - Inria - Institut National de Recherche en Informatique et en Automatique - CNRS - Centre National de la Recherche Scientifique - UGA - Université Grenoble Alpes - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes, STATIFY - Modèles statistiques bayésiens et des valeurs extrêmes pour données structurées et de grande dimension - Inria Grenoble - Rhône-Alpes - Inria - Institut National de Recherche en Informatique et en Automatique - LJK - Laboratoire Jean Kuntzmann - Inria - Institut National de Recherche en Informatique et en Automatique - CNRS - Centre National de la Recherche Scientifique - UGA - Université Grenoble Alpes - Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology - UGA - Université Grenoble Alpes); Gilles Claude Stupfler (CREST - Centre de Recherche en Economie et Statistique [Bruz] - ENSAI - Ecole Nationale de la Statistique et de l'Analyse de l'Information [Bruz]); Antoine Usseglio-Carleve (TSE - Toulouse School of Economics - UT1 - Université Toulouse 1 Capitole - Université Fédérale Toulouse Midi-Pyrénées - EHESS - École des hautes études en sciences sociales - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement)
Abstract:	Expectiles define a least squares analogue of quantiles. They have been the focus of a substantial quantity of research in the context of actuarial and financial risk assessment over the last decade. The behaviour and estimation of unconditional extreme expectiles using independent and identically distributed heavy-tailed observations has been investigated in a recent series of papers. We build here a general theory for the estimation of extreme conditional expectiles in heteroscedastic regression models with heavy-tailed noise; our approach is supported by general results of independent interest on,residual-based extreme value estimators in heavy-tailed regression models, and is intended to cope with covariates having a large but fixed dimension. We demonstrate how our results can be applied to a wide class of important examples, among which linear models, single-index models as well as ARMA and GARCH time series models. Our estimators are showcased on a numerical simulation study and on real sets of actuarial and financial data.
Keywords:	Tail empirical process of residuals,Single-indes model,Residual-based estimators,Regression models,Heteroscedasticity,Heavy-tailed distribution,Extreme value analysis,Expectiles
Date:	2021–06
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-03306230&r=

Bandwidth Selection for Nonparametric Regression with Errors-in-Variables

By:	Hao Dong (Southern Methodist University); Taisuke Otsu (London School of Economics and Political Science); Luke Taylor (Aarhus University)
Abstract:	We propose two novel bandwidth selection procedures for the nonparametric regression model with classical measurement error in the regressors. Each method is based on evaluating the prediction errors of the regression using a second (density) deconvolution. The first approach uses a typical leave-one-out cross validation criterion, while the second applies a bootstrap approach and the concept of out-of-bag prediction. We show the asymptotic validity of both procedures and compare them to the SIMEX method of Delaigle and Hall (2008) in a Monte Carlo study. As well as enjoying advantages in terms of computational cost, the methods proposed in this paper lead to lower mean integrated squared error compared to the current state-of-the-art.
Keywords:	Bandwidth selection, nonparametric regression, deconvolution, classical measurement error.
JEL:	C14
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:smu:ecowpa:2104&r=

Biases on variances estimated on large data-sets

By:	François Gardes (CES - Centre d'économie de la Sorbonne - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique, UP1 - Université Paris 1 Panthéon-Sorbonne, PSE - Paris School of Economics - ENPC - École des Ponts ParisTech - ENS Paris - École normale supérieure - Paris - PSL - Université Paris sciences et lettres - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique - EHESS - École des hautes études en sciences sociales - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement, UCO - Université Catholique de l'Ouest)
Abstract:	The inverse dependency of the estimated variances over the sample size throws a fundamental question on the validity of the usual statistical methodology, since any hypothesis on the value of a coefficient can be tested negatively by increasing the size of the data-set. I suppose that large data-sets are characterized by a concentration of information on homogenous sub-populations, a spatial autocorrelation of the error terms and the covariates may bias the estimation of variances. Using the corrections of variances under spatial autocorrelation, we obtain variances comparable to an estimation on sub-samples (named efficient sub-samples) the sizes of which are sufficient to contain the information which gives rise to similar estimates to those obtained on the whole population. Moreover, the estimation on efficient data-sets does not necessitate the specification of the spatial autocorrelations which are supposed to bias the estimated variances.
Keywords:	dataset,estimated variance,spatial autocorrelation,grouped observations
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:hal:cesptp:halshs-03325118&r=

Accounting for Spatial Autocorrelation in Algorithm-Driven Hedonic Models: A Spatial Cross-Validation Approach

By:	Juergen Deppner; Marcelo Cajias; Wolfgang Schäfers
Abstract:	Aim of research: Real estate markets are featured with a spatial dimension that is pivotal for the economic value of housing. The inherent spatial dependence in the underlying price determination process cannot be simply overlooked in linear hedonic model specifications, as this would render spurious results (see Anselin, 1988; Can and Megbolugbe, 1997; Basu and Thibodeau, 1998). Guidance on how to account for spatial dependence in linear regression models is vast and remains subject of many contributions to the hedonic and spatial econometric literature (see LeSage and Pace, 2009; Anselin, 2010; Elhorst, 2014). Moving from the parametric paradigm of hedonic regression methods to the universe of non-parametric statistical learning methods such as decision trees, random forests, or boosting techniques, literature has brought forth an increasing body of evidence that such algorithms are capable of providing a superior predictive performance for complex non-linear and multi-dimensional regression problems, including various applications to house price estimation (e.g. Mayer et al., 2019; Pace and Hayunga, 2020; Bogin and Shui, 2020). However, in contrast to linear models, little attention has been paid to the implications of spatial dependence in house prices for the statistical validity of error estimates of machine learning algorithms although independence of the data is implicitly assumed (see Roberts et al., 2017; Schratz et al., 2019). Our study aims at investigating the role of spatial autocorrelation (SAC) on the accuracy assessment of algorithmic hedonic methods, thereby benchmarking spatially conscious machine learning approaches to linear and spatial hedonic methods. Study design and methodology: Machine learning algorithms learn the relationship between the response and the regressors autonomously without requiring any a-priori specifications about their functional form. As their high flexibility makes such approaches prone to overfitting, resampling strategies such as k-fold cross validation are applied to approximate a models out-of-sample predictive performance. During resampling, the observations are randomly partitioned into mutually exclusive training and test subsets, whereby the predictor is fitted on the training data and evaluated on the test data. SAC can be accounted for using spatial resampling strategies which attempt to reduce SAC between training and test data through a modification in the splitting process. Instead of randomly partitioning the data which implicitly assumes their independence, spatially clustered partitions are created using the observations coordinates (see Brenning, 2012). We train and evaluate tree-based algorithms on a pooled cross-section of asking rents in Germany using both, random as well as spatial partitioning and subsequently forecast out-of-sample data to assess the bias in the in-sample error estimates associated with SAC. The results are benchmarked to well-specified ordinary least squares and spatial autoregressive frameworks to compare the models generalizability. Originalty and implications: Applying machine learning to spatial data without accounting for SAC provides the predictor with information that is assumed to be unavailable during training, which may lead to biased accuracy assessment (see Lovelace et al., 2021). This study sheds light on the accuracy bias of random resampling induced by SAC in a hedonic context. The results prove useful for increasing the robustness and generalizability of algorithmic approaches to hedonic regression problems, thereby containing valuable implications for appraisal practices. To the best of our knowledge, no research in the existing literature has thus far accounted for SAC in an algorithm-driven hedonic context by applying spatial cross-validation. We conclude that random resampling yields over-optimistic prediction accuracies whereas spatial resampling increases generalizability, and thus robustness to unseen data. We also find the bias to be lower for algorithms which apply column-subsampling to counteract overfitting.
Keywords:	Hedonic Models; Machine Learning; Spatial Autocorrelation; Spatial Cross Validation
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_51&r=

Self-fulfilling Bandits: Endogeneity Spillover and Dynamic Selection in Algorithmic Decision-making

By:	Jin Li; Ye Luo; Xiaowei Zhang
Abstract:	In this paper, we study endogeneity problems in algorithmic decision-making where data and actions are interdependent. When there are endogenous covariates in a contextual multi-armed bandit model, a novel bias (self-fulfilling bias) arises because the endogeneity of the covariates spills over to the actions. We propose a class of algorithms to correct for the bias by incorporating instrumental variables into leading online learning algorithms. These algorithms also attain regret levels that match the best known lower bound for the cases without endogeneity. To establish the theoretical properties, we develop a general technique that untangles the interdependence between data and actions.
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2108.12547&r=

Assessing partial association between ordinal variables: quantification, visualization, and hypothesis testing

By:	Liu, Dungang; Li, Shaobo; Yu, Yan; Moustaki, Irini
Abstract:	Partial association refers to the relationship between variables Y1,Y2,...,YK while adjusting for a set of covariates X = {X1, . . . , Xp}. To assess such an association when Yk’s are recorded on ordinal scales, a classical approach is to use partial corre- lation between the latent continuous variables. This so-called polychoric correlation is inadequate, as it requires multivariate normality and it only reflects a linear associa- tion. We propose a new framework for studying ordinal-ordinal partial association by using surrogate residuals (Liu and Zhang, JASA, 2018). We justify that conditional on X, Yk and Yl are independent if and only if their corresponding surrogate residual variables are independent. Based on this result, we develop a general measure φ to quantify association strength. As opposed to polychoric correlation, φ does not rely on normality or models with the probit link, but instead it broadly applies to models with any link functions. It can capture a non-linear or even non-monotonic association. Moreover, the measure φ gives rise to a general procedure for testing the hypothesis of partial independence. Our framework also permits visualization tools, such as par- tial regression plots and 3-D P-P plots, to examine the association structure, which is otherwise unfeasible for ordinal data. We stress that the whole set of tools (measures, p-values, and graphics) is developed within a single unified framework, which allows a coherent inference. The analyses of the National Election Study (K = 5) and Big Five Personality Traits (K = 50) demonstrate that our framework leads to a much fuller assessment of partial association and yields deeper insights for domain researchers.
Keywords:	covariate adjustment; multivariate analysis; partial regression plot; polychoric correlation; rating data; surrogate residual
JEL:	C1
Date:	2020–08–26
URL:	http://d.repec.org/n?u=RePEc:ehl:lserod:105558&r=

Exploring volatility of crude oil intra-day return curves: a functional GARCH-X Model

By:	Rice, Gregory; Wirjanto, Tony; Zhao, Yuqian
Abstract:	Crude oil intra-day return curves collected from the commodity futures market often appear to be serially uncorrelated and long-range dependent. Existing functional GARCH models, while able to accommodate short range conditional heteroscedasticity, are not designed to capture long-range dependence. We propose and study a new functional GARCH-X model for this purpose, where the covariate X is chosen to be weakly stationary and long-range dependent. Functional analogs of autocorrelation coefficients of squared processes for this model are derived, and compared to those estimated from crude oil return curves. The results show that the FGARCH-X model provides a significant correction to existing functional volatility models in terms of an in-sample fitting, while its out-of-sample performances do not appear to be more superior than those of the existing functional GARCH models.
Keywords:	Crude oil intra-day return curves, volatility modeling and forecasting, functional GARCH-X model, long-range dependence, basis selection
JEL:	C13 C32 C58 G10 G17
Date:	2021–08–18
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:109231&r=

On Extending Stochastic Dominance Comparisons to Ordinal Variables and Generalising Hammond Dominance

By:	Gordon John Anderson; Teng Wah Leo
Abstract:	Following the increasing use of discrete ordinal data for well-being analysis, this note builds on Hammond (Hâˆ’) dominance concepts developed in Gravel et al. (2020) for discrete ordinal variables by observing and exploiting the fact that the coefficients associated with successive sums of cumulative distribution functions are Binomial coefficient functions of the order of dominance under consideration. Drawing first on notions of stochastic dominance relations for continuous variables to develop analogous concepts for discrete ordinal variables, it highlights the important limitation that increasing orders of dominance lead to loss of degrees of freedom which can be significant when the number of categories is low, as is common among ordered categorical variables, effectively bounding the maximum order of dominance. However, expanding on Hâˆ’ dominance by utilising the Binomial coefficients facilitates sequential consideration of higher orders of Hâˆ’ dominance without this loss, thereby surmounting the limitation.
Keywords:	Stochastic Dominance; Discrete Variables; Ordinal Variables; Hammond Transfers
JEL:	C14 I3
Date:	2021–09–01
URL:	http://d.repec.org/n?u=RePEc:tor:tecipa:tecipa-705&r=

Dynamic relationship between Stock and Bond returns: A GAS MIDAS copula approach

By:	Nguyen, Hoang (Örebro University School of Business); Javed, Farrukh (Örebro University School of Business)
Abstract:	Stock and bond are the two most crucial assets for portfolio allocation and risk management. This study proposes generalized autoregressive score mixed frequency data sampling (GAS MIDAS) copula models to analyze the dynamic dependence between stock returns and bond returns. A GAS MIDAS copula decomposes their relationship into a short-term dependence and a long-term dependence. While the long-term dependence is driven by related macro-finance factors using a MIDAS regression, the short-term effect follows a GAS process. Asymmetric dependence at different quantiles is also taken into account. We find that the proposed GAS MIDAS copula models are more effective in optimal portfolio allocation and improve the accuracy in risk management compared to other alternatives.
Keywords:	GAS copulas; MIDAS; asymmetry
JEL:	C32 C52 C58 G11 G12
Date:	2021–08–30
URL:	http://d.repec.org/n?u=RePEc:hhs:oruesi:2021_015&r=

Cherry Picking

By:	Lang, Megan (The Abdul Latif Jameel Poverty Action Lab); Qiu, Wenfeng
Abstract:	Measures like pre-analysis plans ask researchers to describe planned data collection and justify data exclusions, but they provide little enforceable oversight of primary data collection. We show that a simple algorithm can select large subsets of data that yield economically meaningful and statistically significant treatment effects. The subsets cannot be distinguished from a random sample of the original data, rendering the selection undetectable if peer reviewers are unaware of the size of the original dataset. Our results hold using simulated data and replication data from a well-known study. We show that there are few natural deterrents to dataset manipulation: the results in our selected subset are robust to a range of alternative specifications, our algorithm performs well under complex sampling strategies, and our subset can yield artificially high effects on multiple outcomes. We conclude by proposing a measure to prevent such manipulation in field experiments.
Date:	2021–08–24
URL:	http://d.repec.org/n?u=RePEc:osf:metaar:as9zd&r=

Robust PCA Synthetic Control

By:	Mani Bayani
Abstract:	In this study, I propose a five-step algorithm for synthetic control method for comparative studies. My algorithm builds on the synthetic control model of Abadie et al., 2015 and the later model of Amjad et al., 2018. I apply all three methods (robust PCA synthetic control, synthetic control, and robust synthetic control) to the answer the hypothetical question, what would have been the per capita GDP of West Germany if it had not reunified with East Germany in 1990? I then apply all three algorithms in two placebo studies. Finally, I check for robustness. This paper demonstrates that my method can outperform the robust synthetic control model of Amjad et al., 2018 in placebo studies and is less sensitive to the weights of synthetic members than the model of Abadie et al., 2015.
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2108.12542&r=

Evaluation of technology clubs by clustering: A cautionary note

By:	Andres, Antonio Rodriguez; Otero, Abraham; Amavilah, Voxi Heinrich
Abstract:	Applications of machine learning techniques to economic problems are increasing. These are powerful techniques with great potential to extract insights from economic data. However, care must be taken to apply them correctly, or the wrong conclusions may be drawn. In the technology clubs literature, after applying a clustering algorithm, some authors train a supervised machine learning technique, such as a decision tree or a neural network, to predict the label of the clusters. Then, they use some performance metric (typically, accuracy) of that prediction as a measure of the quality of the clustering configuration they have found. This is an error with potential negative implications for policy, because obtaining a high accuracy in such a prediction does not mean that the clustering configuration found is correct. This paper explains in detail why this modus operandi is not sound from theoretical point of view and uses computer simulations to demonstrate it. We caution policy and indicate the direction for future investigations.
Keywords:	Machine learning; clustering, technological change; technology clubs; knowledge economy; cross-country
JEL:	C45 C53 O38 O57 P41
Date:	2021–05–15
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:109138&r=

Peeking inside the Black Box: Interpretable Machine Learning and Hedonic Rental Estimation

By:	Marcelo Cajias; Willwersch Jonas; Lorenz Felix; Franz Fuerst
Abstract:	Machine Learning (ML) can detect complex relationships to solve problems in various research areas. To estimate real estate prices and rents, ML represents a promising extension to the hedonic literature since it is able to increase predictive accuracy and is more flexible than the standard regression-based hedonic approach in handling a variety of quantitative and qualitative inputs. Nevertheless, its inferential capacity is limited due to its complex non-parametric structure and the ‘black box’ nature of its operations. In recent years, research on Interpretable Machine Learning (IML) has emerged that improves the interpretability of ML applications. This paper aims to elucidate the analytical behaviour of ML methods and their predictions of residential rents applying a set of model-agnostic methods. Using a dataset of 58k apartment listings in Frankfurt am Main (Germany), we estimate rent levels with the eXtreme Gradient Boosting Algorithm (XGB). We then apply Permutation Feature Importance (PFI), Partial Dependence Plots (PDP), Individual Conditional Expectation Curve (ICE) and Accumulated Local Effects (ALE). Our results suggest that IML methods can provide valuable insights and yield higher interpretability of ‘black box’ models. According to the results of PFI, most relevant locational variables for apartments are the proximity to bars, convenience stores and bus station hubs. Feature effects show that ML identifies non-linear relationships between rent and proximity variables. Rental prices increase up to a distance of approx. 3 kilometer to a central bus hub, followed by steep decline. We therefore assume tenants to face a trade-off between good infrastructural accessibility and locational separation from the disamenities associated with traffic hubs such as noise and air pollution. The same holds true for proximity to bar with rents peaking at 1 km distance. While tenants appear to appreciate nearby nightlife facilities, immediate proximity is subject to rental discounts. In summary, IML methods can increase transparency of ML models and therefore identify important patterns in rental markets. This may lead to a better understanding of residential real estate and offer new insights for researchers as well as practitioners.
Keywords:	Explainable Artifical Intelligence; housing; Machine Learning; Non parametric hedonic models
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_104&r=

On the interpretation of black-box default prediction models: an Italian Small and Medium Enterprises case

By:	Lisa Crosato; Caterina Liberati; Marco Repetto
Abstract:	Academic research and the financial industry have recently paid great attention to Machine Learning algorithms due to their power to solve complex learning tasks. In the field of firms' default prediction, however, the lack of interpretability has prevented the extensive adoption of the black-box type of models. To overcome this drawback and maintain the high performances of black-boxes, this paper relies on a model-agnostic approach. Accumulated Local Effects and Shapley values are used to shape the predictors' impact on the likelihood of default and rank them according to their contribution to the model outcome. Prediction is achieved by two Machine Learning algorithms (eXtreme Gradient Boosting and FeedForward Neural Network) compared with three standard discriminant models. Results show that our analysis of the Italian Small and Medium Enterprises manufacturing industry benefits from the overall highest classification power by the eXtreme Gradient Boosting algorithm without giving up a rich interpretation framework.
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2108.13914&r=

This nep-ecm issue is ©2021 by Sune Karlsson. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.