Detection of influential points as a byproduct of resampling-based variable selection procedures

dc.date.accessioned	2017-12-12T16:26:48Z
dc.date.available	2019-07-20T22:46:18Z
dc.date.created	2017-08-03T13:45:48Z
dc.date.issued	2017
dc.identifier.citation	De Bin, Riccardo Boulesteix, Anne-Laure Sauerbrei, Willi . Detection of influential points as a byproduct of resampling-based variable selection procedures. Computational Statistics & Data Analysis. 2017, 116, 19-31
dc.identifier.uri	http://hdl.handle.net/10852/59343
dc.description.abstract	Influential points can cause severe problems when deriving a multivariable regression model. A novel approach to check for such points is proposed, based on the variable inclusion matrix, a simple way to summarize results from resampling-based variable selection procedures. The variable inclusion matrix reports whether a variable (column) is included in a regression model fitted on a pseudo-sample (row) generated from the original data (e.g., bootstrap sample or subsample). It is used to study the variable selection stability, to derive weights for model averaged predictors and in others investigations. Concentrating on variable selection, it also allows understanding whether the presence of a specific observation has an influence on the selection of a variable. From the variable inclusion matrix, indeed, the inclusion frequency (I-frequency) of each variable can be computed only in the pseudo-samples (i.e., rows) which contain the specific observation. When the procedure is repeated for each observation, it is possible to check for influential points through the distribution of the I-frequencies, visualized in a boxplot, or through a Grubbs’ test. Outlying values in the former case and significant results in the latter point to observations having an influence on the selection of a specific variable and therefore on the finally selected model. This novel approach is illustrated in two real data examples.	en_US
dc.language	EN
dc.publisher	Elsevier
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Unported
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/3.0/
dc.title	Detection of influential points as a byproduct of resampling-based variable selection procedures	en_US
dc.type	Journal article	en_US
dc.creator.author	De Bin, Riccardo
dc.creator.author	Boulesteix, Anne-Laure
dc.creator.author	Sauerbrei, Willi
cristin.unitcode	185,15,13,25
cristin.unitname	Statistikk og biostatistikk
cristin.ispublished	true
cristin.fulltext	postprint
cristin.qualitycode	1
dc.identifier.cristin	1484013
dc.identifier.bibliographiccitation	info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=Computational Statistics & Data Analysis&rft.volume=116&rft.spage=19&rft.date=2017
dc.identifier.jtitle	Computational Statistics & Data Analysis
dc.identifier.volume	116
dc.identifier.startpage	19
dc.identifier.endpage	31
dc.identifier.doi	http://dx.doi.org/10.1016/j.csda.2017.07.001
dc.identifier.urn	URN:NBN:no-62018
dc.type.document	Tidsskriftartikkel	en_US
dc.type.peerreviewed	Peer reviewed
dc.source.issn	0167-9473
dc.identifier.fulltext	Fulltext https://www.duo.uio.no/bitstream/handle/10852/59343/2/CSDA-D-16-01272R2.pdf
dc.type.version	AcceptedVersion