Multiple imputation for Cox regression with sampled cohort data

Njøs, Aleksander

Master thesis

View/Open

aleksandernjosThesis.pdf (720.9Kb)

Year

2020

Abstract

In nested case-control and case-cohort studies of time-to-events, covariate information is collected for all individuals in the sampled cohort. Often information on some of the covariates are easily available for the entire cohort while some can only be collected for a limited amount of individuals; those in the sampled cohort. Multiple imputation, an algorithm for handling missing data, can be used to impute (``fill inn'') covariate values, that have not been collected for individuals in the remaining part of the cohort, a small to moderately number of times. Then, Cox regression estimates from each imputed dataset (cohort) can be combined according to Rubin's rules. Multiple imputation used in this setting has previously been shown to give more efficient inferences by utilising more of the available information outside the sampled cohorts. However, in studies with very large cohorts, multiple imputation for the entire cohort might be very demanding or even infeasible. In this thesis, existing methods for multiple imputation of missing values (by chance) in sampled cohort studies, in their original and an adapted form, are used to impute values in a superset of the sampled cohort. Imputing values missing by design in the superset motivates estimating the regression coefficients with nested case-control or case-cohort estimators. The results from simple simulations experiments show good performance with respect to bias and efficiency. For very large cohorts, the number of controls in a nested case-control superset or the size of the subcohort in a case-cohort superset, determines the size of the part of the cohort that is to be imputed, and superset imputation therefore looks like a promising method when imputation of the entire cohort is not possible.