Use of multiple imputation in supersampled nested case‐control and case‐cohort studies

Nested case‐control and case‐cohort studies are useful for studying associations between covariates and time‐to‐event when some covariates are expensive to measure. Full covariate information is collected in the nested case‐control or case‐cohort sample only, while cheaply measured covariates are often observed for the full cohort. Standard analysis of such case‐control samples ignores any full cohort data. Previous work has shown how data for the full cohort can be used efficiently by multiple imputation of the expensive covariate(s), followed by a full‐cohort analysis. For large cohorts this is computationally expensive or even infeasible. An alternative is to supplement the case‐control samples with additional controls on which cheaply measured covariates are observed. We show how multiple imputation can be used for analysis of such supersampled data. Simulations show that this brings efficiency gains relative to a traditional analysis and that the efficiency loss relative to using the full cohort data is not substantial.


INTRODUCTION
Large cohort studies are widely used in epidemiology to investigate the effect of risk factors and other covariates on the time to an event of interest, like disease diagnosis or death. Cox regression is commonly used to analyze data from such studies. Estimation for Cox's model requires information on risk factors and other covariates for all individuals in the cohort, also for rare diseases where most individuals will not experience the event of interest. To avoid the collection of expensive covariate information, such as biological measurements, for all individuals in the cohort, it may be advantageous to adopt a cohort sampling design. For these designs, information on risk factors and other covariates are recorded for all individuals who experience the event of interest ("cases"), but only for a sample of the individuals who do not experience the event ("controls"). There are two main types of cohort sampling designs: nested case-control studies and case-cohort studies. The two types of cohort sampling designs differ in the way controls are selected. For nested case-control sampling (Thomas, 1977), one for each case selects at random a small number of controls from those at risk at the case's event time, and a new sample of controls is selected for each case. For case-cohort sampling (Prentice, 1986), a subcohort is selected at random from the full cohort at the outset of the study, and the individuals in the subcohort are used as controls at all event times when they are at risk.
As described in Section 2, the traditional analyses of nested case-control and case-cohort data only use information for individuals in the case-control sample, that is, for the cases and controls/subcohort. However, while expensive covariate measurements may be available only for the case-control sample, there may be other cheaply measured covariates that are available for all individuals in the full cohort. The traditional analyses of sampled cohort data ignore this information. The sampled cohort plus the additional data available on the remainder of the full cohort may be viewed as a full cohort study with a large missing data problem, in which the expensive covariate measurements are missing by design for individuals outside the case-control sample. One approach for handling this missing data problem, is maximum likelihood estimation for the full cohort; see, for example, Scheike and Martinussen (2004), Scheike and Juul (2004), Saarela et al. (2008), and Lin (2014, 2018). The paper by Kulathinal and Arjas (2006) is also worth mentioning. They consider a Bayesian analysis of case-cohort data using the full cohort likelihood and Bayesian data augmentation. Another popular approach for handling missing data problems, is multiple imputation (MI), which essentially is an approximation to a full Bayesian analysis. MI has good frequentist properties, and it is easier to implement than a full Bayesian analysis. In the context of Cox regression for sampled cohort data, MI has been studied by Keogh and White (2013) and Keogh et al. (2018); see also the review by Keogh (2018).
In studies with very large cohorts, maximum likelihood estimation or the use of MI for the entire cohort may be computationally very demanding or even infeasible. Focusing on MI, we in this paper investigate a middle way between a traditional sampled cohort study and a sampled cohort study with MI for the full cohort. The idea is to select a supersample of the sampled cohort by adding more controls to a nested case-control study and enlarging the subcohort for a case-cohort study. The expensive covariate measurements are imputed for the individuals in the supersample who are not in the original case-control sample, but not for the other individuals in the full cohort. The data for the imputed supersample may then be analyzed using the traditional methods for analysing sampled cohort data.
The outline of the paper is as follows. In Section 2 we briefly describe Cox's regression model for the full cohort, and review the traditional ways of analyzing sampled cohort data using this model. MI in Cox regression with cohort data and sampled cohort data is reviewed in Section 3, and in Section 4 we describe how MI may be modified when imputation is only performed for individuals in the supersample. In Section 5 we present a simulation study that illustrates how MI for the supersample compares with MI for the full cohort and the traditional methods for analysing sampled cohort data. Finally, in Section 6, we discuss our results and point out some directions for further research. Some simulation results that supplement those of Section 5 are given as Supporting Information on the journal's web page.

TRADITIONAL ANALYSIS OF SAMPLED COHORT DATA
We start out with a review of the traditional ways of analyzing sampled cohort data. For ease of presentation, we assume that there is a single covariate X that is expensive to measure, while Z is a vector of cheaply measured covariates. The situation with more than one expensive covariate is briefly discussed in Section 6. The hazard rate for an individual with covariate values X = x and Z = z is denoted h(t | x, z). The time variable t may be age, time since the onset of a disease, or some other time scale relevant to the problem at hand. We assume that the covariates are related to the hazard rate by Cox's proportional hazards model Here and are regression coefficients that describe the effects of the covariates on the hazard, while the baseline hazard h 0 (t) corresponds to the hazard of an individual with all covariates equal to zero. We consider a cohort  = {1, 2,…, N} of N independent individuals, and suppose for a moment that both X and Z are observed for the full cohort. The values of (X, Z) for individual i are denoted (x i , z i ). We do not observe the event of interest for all individuals. Due to censoring, we for each individual i ∈  only observe (T i , D i ), where T i is the minimum of an event time and a censoring time, and D i = 1 if the event is observed, and D i = 0 otherwise. The values of T i and D i are denoted t i and d i . We assume throughout that the censoring and event times for individual i are independent given X i , Z i , and that the censoring time and X i are independent given Z i . The latter assumption is not required for the methods in this section, but it is needed when using MI in Sections 3 and 4.
Then for the full cohort, the regression coefficients and in (1) are estimated by the valueŝ and̂that maximize Cox's partial likelihood Here (t) = {j | t j ≥ t} is the risk set at time t, and  = {j | d j = 1} is the set of all cases. Further the Breslow estimator of the cumulative baseline hazard H 0 (t) = ∫ t 0 h 0 (u)du may be given aŝ H 0,coh (t;̂,̂), whereĤ It is well known that the maximum partial likelihood estimators for the full cohort have similar properties as ordinary maximum likelihood estimators (Andersen & Gill, 1982). In particular, SEs of the parameter estimates may be obtained from the observed information matrix. As described below, nested case-control and case-cohort data are traditionally analyzed using modified versions of Cox's partial likelihood and the Breslow estimator.

Nested case-control studies
The controls in a nested case-control study are selected as follows. If the event of interest is observed for an individual i at time t i , one selects at random a small number m of controls from (t i ) ⧵ {i}, that is, the risk set with the case excluded. The set of size m + 1 that consists of the case and the sampled controls is denoted a sampled risk set and denoted(t i ). The expensive covariate is measured for the individuals in the sampled risk sets, but it is not needed for the other individuals in the full cohort. Traditionally, estimation of the regression coefficients in a nested case-control study is based on the partial likelihood Note that (4) is similar to the full cohort partial likelihood (2), except that the sum in the denominator is only over individuals in a sampled risk set. For computing one may use standard software for Cox regression, like coxph in the survival library in R, formally treating the label of the sampled risk sets as a stratification variable in the Cox regression. The cumulative baseline hazard may be estimated by a Breslow-type estimator given aŝ H 0,ncc (t;̂,̂), whereĤ and̂and̂are obtained by maximizing (4). In (5), the weights are w(t i ) = N(t i )∕(m + 1), where N(t) = |(t)| is the number of individuals at risk at time t. The maximum partial likelihood estimatorŝand̂obtained by maximizing (4) enjoy similar large sample properties as ordinary maximum likelihood estimators Goldstein & Langholz, 1992).

Case-cohort studies
For the case-cohort design, a subcohort of size n is selected at random from the full cohort at the outset of the study. The individuals in the subcohort are used as controls at all event times when they are at risk. The expensive covariate is measured for all individuals in the subcohort as well as for cases occurring outside the subcohort, but it is not needed for the other individuals in the full cohort. Different methods have been suggested for estimating the regression coefficients for case-cohort data. The methods are based on weighted pseudo-likelihoods of the form but the methods differ in the choice of weights w j and sampled risk sets(t i ). For Prentice's original pseudo-likelihood, all weights are one and( is the set of all subcohort individuals at risk at time t i . Self and Prentice (1988) considered the modification where(t i ) = (t i ) also when the case is not in the subcohort.
In Prentice's pseudo-likelihood, a case outside the subcohort only makes a contribution at its event time. In order to make use of the information from the cases at all times when they are at risk, we may adopt an inverse probability weighted (IPW) pseudo-likelihood (Kalbfleisch & Lawless, 1988). Then we let(t i ) = {j | t j ≥ t i , j ∈ ∪ }, where  is the set of all cases. So now (t i ) consists of all subcohort individuals at risk at time t i together with all cases who are at risk at that time. The weights are given as w j = 1∕p j , where p j is the probability that individual j is included in the case-control sample. The cases are included with probability one, so the weights are w j = 1 for all cases (whether they are in the subcohort or not). We let N (0) and n (0) be the number of individuals in the cohort and subcohort, respectively, who do not experience the event of interest. Then an individual who is not a case, is included in the subcohort with probability p j = n (0) ∕N (0) , and hence may be given the weight w j = N (0) ∕n (0) .
Pseudo-likelihoods of the form (6) are not partial likelihoods, so the maximum pseudo-likelihood estimators do not enjoy similar large sample properties as ordinary maximum likelihood estimators. In particular, SEs cannot be computed directly from the observed pseudo-information matrix. But it has been shown that the maximum pseudo-likelihood estimators are approximately multivariate normally distributed, and estimators for their variance-covariance matrices have been worked out; see, for example, Self and Prentice (1988), Therneau and Li (1999), Borgan et al. (2000), and Samuelsen et al. (2007). The estimators based on (6) may be computed using the cch-command in the survival package in R. Here method="Prentice," method="SelfPrentice," and method = "LinYing" give the three estimators mentioned above.
For case-cohort data we may estimate the cumulative baseline hazard by a Breslow-type estimator given asĤ 0,cch (t;̂,̂), wherê and̂and̂are obtained by maximizing (6). For the IPW estimator, the sampled risk sets(t i ) and weights in (7) are as given above, sow j = 1 for cases andw j = N (0) ∕n (0) for noncases. For the Prentice estimator, we may usew j = N∕n for all individuals in the subcohort and let(t i ) = (t i ) (Prentice, 1986, p. 6).

MI IN COX REGRESSION
In this section we outline methods for MI of missing covariate data in Cox regression in analysis of cohort data, and their extension to sampled cohort data in which an expensive covariate is missing for everyone outside the case-control sample. We assume data are missing at random (MAR), which can be expressed as the assumption that the probability of missingness in a given variable is independent of missing data conditional on observed data (Rubin, 1987;Carpenter & Kenward, 2013, chapter 1).

MI in full cohort studies
We refer to the book of Carpenter and Kenward (2013), for example, for an overview of the theory of MI, and here focus on its use when the analysis model is a Cox regression. In brief, the MI procedure involves imputing missing values in covariates by taking a random draw from the estimated posterior distribution of the missing data given the observed data (including the outcome). This is repeated K times to give K "complete" imputed data sets in which missing values in covariates are filled in. The analysis model, in our case a Cox regression, is then fitted to each of the K imputed data sets to give K estimates of the regression coefficients. Pooled estimates and a corresponding variance-covariance matrix are then obtained using "Rubin's rules" (e.g., Carpenter & Kenward, 2013, pp. 45-46). First consider a full cohort study in which a single covariate X has missing data, and Z denotes a vector of fully observed covariates. It is assumed that outcome information (T, D) is observed for all individuals in the cohort. The aim in MI is therefore to impute missing values of X from the conditional distribution of X given Z = z, T = t and D = d. It has been shown that when the analysis model is a Cox regression, as in (1), this conditional distribution is not of any standard form, such as a normal distribution, which presents a challenge for obtaining imputations. To address this, two main approaches have been described for imputation of missing data on covariates in Cox regression.
The first is that of White and Royston (2009) who derived an approximate form for the imputation model, which is a regression of X on z, d, andĤ(t), whereĤ(t) denotes the Nelson-Aalen estimate of the cumulative hazard at the individual's event or censoring time t. We refer to this approach as MI-Approx.
The second approach, described by Bartlett et al. (2015), uses rejection sampling and an iterative procedure to impute missing values of X from the conditional distribution that is compatible with the outcome (or "substantive") model, which here is a Cox proportional hazards model. The basis of the method of Bartlett et al. (2015) is that we, for each iteration, draw a potential value for X from a "proposal distribution," and then use a "rejection rule" to decide whether to accept the potential value as a draw from the distribution of interest, that is, the conditional distribution of X given Z = z, T = t and D = d. In a given iteration, the rejection rule is to accept a potential value x * as an imputed value for a missing value of X if Here U denotes a random draw from a standard uniform distribution, * , * denote posterior draws of the model parameters, obtained after fitting the Cox regression analysis model to the current imputed data, and H * 0 (t) =Ĥ 0,coh (t; * , * ) denotes the cumulative baseline hazard obtained using these parameter draws; compare formula (3). This approach is referred to as the "substantive model compatible full conditional specification" (SMC-FCS) method. Following earlier work, we refer to it as MI-SMC.
MI-Approx involves specification of the model for each covariate with missingness given the other covariates and the outcome, which in our context is f (X|Z, T, D). For continuous X a conditional normal distribution is often assumed as an approximation. This approximation is derived from assuming X is normally distributed conditional on Z, in the case of continuous X. MI-SMC requires specification of the model f (X|Z), which typically takes a more standard form than f (X|Z, T, D). For a continuous X it is common to assumes that f (X|Z) is a normal distribution. Because MI-SMC accommodates nonlinear terms in the Cox regression model, it is possible under this approach to instead specify a distribution for a transformation g(X) (e.g. g(X) = log X) given Z, with the imputations of g(X) then being back-transformed for use in the Cox regression including X and Z. We refer to Bartlett et al. (2015) for a detailed discussion of compatibility of models used in MI and imputation model mis-specification, including extensions to the situation in which there is more than one covariate with missingness.
For overviews of the MI-Approx and MI-SMC methods in the content of Cox regression, see Carpenter & Kenward (2013, chapter 8), Keogh et al. (2018), and Keogh (2018). The MI-Approx method has been found to perform well in a range of circumstances, in particular for rare events. But it does not apply when there are nonlinear terms involving the missing covariate X in the proportional hazards model, including interactions between X and Z. MI-SMC accommodates nonlinear terms (including interactions) because it obtains draws of X in such as way that they are drawn from a distribution that is compatible with the proportional hazards model. MI-approx can be implemented using the mice package in R and MI-SMC using the smcfcs package.

MI in sampled cohort studies
Nested case-control and case-cohort studies can be viewed as full cohort studies with data missing for the expensive covariate X. The methods for MI in a full cohort study outlined in the preceding Section 3.1 can be applied directly to impute missing values of X for individuals in the full cohort who were not sampled to the nested case-control or case-cohort study. The use of MI in this way for both nested case-control and case-cohort studies was described by Keogh and White (2013), who assessed the two imputation approaches using simulation studies. There are two main ways in which the use of MI in this setting differs from its use in a more standard missing data setting. Firstly, for sampled cohort data the expensive covariate is typically unmeasured for a very large proportion of the full cohort. This approach therefore involves imputing a large proportion of values for the expensive covariate, and one might expect that the consequences of mis-specifying the imputation model could be severe in this situation. Secondly, in the standard MI setting, the assumption that data are MAR is a crucial assumption that we cannot be certain is met. However, the MAR assumption for sampled cohort data follows by the design of the studies, and in this sense the use of MI in this situation is "safer" than in the standard setting.

SUPERSAMPLING OF SAMPLED COHORTS
We now describe more in detail how we obtain a supersample for nested case-control and case-cohort data, and how we may use MI to impute missing covariate values for the supersample.
In general, a supersampled nested case-control study is a standard nested case-control study augmented with additional controls for each case, and a supersampled case-cohort study is a standard case-cohort study augmented with an additional random subcohort. In both types of study, the expensive covariate X is observed only for individuals in the original case-control sample, whereas Z, T, and D are observed for all individuals in the supersample. The analysis makes use of the individuals in the supersample, but it does not use the remaining individuals in the full cohort. We start out by considering a supersampled case-cohort study since imputation is more straightforward for this design than for the nested case-control design.

Supersampling for case-cohort data
For a case-cohort study, we have a subcohort of size n selected by simple random sampling from the full cohort . The expensive covariate is measured for individuals in the case-control sample ∪ {j | d j = 1}, but not for the remaining individuals in the cohort. We now select another random sample  * of size n * from  ⧵, that is, from the individuals in the cohort who are not in the original subcohort. In this way we obtain a supersampled case-cohort study with subcohort  ∪  * of size n + n * . Note that for the supersampled case-cohort data, the expensive covariate X is not measured for individuals in  * ∩ {j | d j = 0}, that is, for individuals in the supersample who are not in the original case-cohort sample. The inexpensive covariate, Z, is observed for all individuals in the supersample. Because X is missing for individuals in  * ∩ {j | d j = 0} we use MI to impute it. In Section 3.2 we outlined how a case-cohort study within a full cohort may be viewed as a full cohort study with missing data, and imputation methods for full cohort data can therefore be used. The supersampled case-cohort data is not a full cohort study, and therefore a modified approach is needed to perform the imputation. Keogh et al. (2018) have discussed how the MI-Approx and MI-SMC methods can be used for imputing missing data on covariates within a case-cohort sample, assuming missingness is at random. The supersampled case-cohort study represents a similar situation, except that the data on X are missing by design and therefore the missingness in X is at random. Hence we may use the imputation methods described by Keogh et al. (2018) as summarized in the following paragraphs.
In MI-Approx the imputation model is a regression of X on z, d, andĤ(t). HereĤ(t) is the Nelson-Aalen estimate of the cumulative hazard at the individual's event or censoring time t. Since the Nelson-Aalen estimate only depends on the T i 's and D i 's, it may be computed from the available data for the full cohort, and it is not computationally intensive to obtain for large cohorts. The MI-Approx method can then be applied directly in a supersampled case-cohort study, by fitting the imputation model using all individuals in the supersampled data.
The MI-SMC approach requires an estimate H * 0 (t) of the cumulative baseline hazard at each person's event or censoring time; compare formula (8). As outlined in Section 2.2, the cumulative baseline hazard can be estimated from a case-cohort study usingĤ 0,cch (t; , ) as given in equation (7). This estimator can also be applied to the supersampled case-cohort data. For the MI-SMC approach we then use H * 0 (t) =Ĥ 0,cch (t; * , * ) computed at each iteration of the MI-SMC procedure using posterior draws * , * of the model parameters and the most recent set of imputed values of X. The MI-SMC approach also involves estimating the distribution of X|Z, as draws of X from this distribution are used as potential imputed values in the rejection sampling. The parameters of the distribution of X|Z can be estimated from the supersampled subcohort, which is a random sample from the full cohort.
After obtaining imputed values of X for individuals in the supersample using MI-Approx or MI-SMC, a standard case-cohort analysis is applied to each imputed data set and the parameter estimates and their standard errors combined using Rubin's rules.

Supersampling for nested case-control data
In a nested case-control study, we for each i ∈  select at random m controls from (t i ) ⧵ {i}.
The sampled risk set(t i ) consists of the case i and its sampled controls. The expensive covariate is measured for individuals in the case-control sample ∪ i∈ (t i ), but not for the remaining individuals in the cohort. For each i ∈  we now add more controls by selecting at random m * individuals from (t i ) ⧵(t i ). The set of these additional controls is denoted  * (t i ). In this way we obtain a supersampled nested case-control study with sampled risk sets(t i ) ∪  * (t i ) of size m + m * . Note that for the supersampled nested case-control data, the expensive covariate is not measured for individuals in (∪ i∈  * (t i )) ∩ {j | d j = 0}, that is, for individuals in the supersample who are not in the original case-control sample. As described in Section 3.1, the aim of MI for cohort data is to impute a missing value of X from the conditional distribution of X given the observed data Z = z, T = t and D = d, and this may be achieved using the exact or an approximate conditional distribution. As the extended subcohort ∪  * for supersampled case-cohort data is a random sample from the full cohort, this also applies (with d = 0) for supersampled case-cohort data (Section 4.1). However, the controls in a supersampled nested case-control study are not a random sample from the full cohort. Rather, if an event occurs at time s, all individuals with T > s are potential controls, including individuals who later become a case. Therefore, in a supersampled nested case-control study, the missing value of X for a control at time s should be imputed from the conditional distribution of X given Z = z and T > s.
To see how this imputation model relates to the one for the full cohort and supersampled case-cohort data, we may for an individual with T > s introduce T (s) = s and D (s) = 0, corresponding to the censored event time and event indicator we would have if the individual had been censored at time s. Then the imputation model for a control at time s may be given as the conditional distribution of X given Z = z, T (s) = s and D (s) = 0. This is similar to the imputation model for the full cohort and supersampled case-cohort data, but we condition on T (s) = s and D (s) = 0 rather than T = t and D = d. Thus, for MI-Approx, we should use the Nelson-Aalen estimate evaluated at the time of the case to which a control is matched, and not the control's own censoring time.
To apply MI-SMC in a supersampled nested case-control study, we can use the cumulative baseline hazard estimator (5). To obtain imputations for control individuals it is appropriate to use the estimate H * 0 (s) =Ĥ 0,ncc (s; * , * ) obtained at the event time s of the case to which the control is matched. We also require estimates of the parameters of the distribution of X|Z, which can be obtained from a regression of X on Z in the controls.
Some individuals can feature as controls for more than one case, and in both MI-approx and MI-SMC we then obtain a new imputed value for each duplicate. After obtaining imputed values of X for individuals in the supersample, a standard nested case-control analysis is applied to each imputed data set and the parameter estimates and their standard errors combined using Rubin's rules.

Simulation aims and implementation
We use a simulation study to illustrate the methods described above and to assess their performance. The simulation study is structured following the guidance by Morris et al. (2019).

Aims
The aims of the simulation study are to illustrate the proposed MI methods for analysis of supersampled nested case-control and case-cohort studies. Specifically we wish to check that the methods give approximately unbiased estimates of the parameters of interest where expected, and to investigate the efficiency gains from the use of supersampled data compared with the original study data. We also wish to check that the standard errors obtained by Rubin's rules are approximately unbiased, and that the estimates have good coverage.

Data-generating mechanisms
Data are generated for a full cohort of N individuals. Three covariates, (X, Z 1 , Z 2 ), are generated for each individual. X is the expensive covariate, which we will later assume is observed only in the nested case-control or case-cohort study, and (Z 1 , Z 2 ) are cheaply measured covariates. We generated Z 1 from a standard normal distribution, and independently Z 2 from a Bernoulli distribution with probability 0.5. The expensive covariate X was generated by where , independent of Z 1 and Z 2 , is a random variable with mean 0 and variance 1. The distribution of was assumed to follow a normal distribution or a log-normal distribution (shifted to have mean 0). The latter was generated as = e Y − e 2 ∕2 , where Y is N(0, 2 )-distributed with = 0.694. Event times T E were generated from the Weibull hazard model We used shape parameter = 4 and values of the scale parameter E as described below, and considered a scenario with no interaction between X and Z 1 using X = 1, Z1 = 0.5, Z2 = 1, XZ1 = 0, and a scenario with an interaction between X and Z 1 using X = 1, Z1 = 0.5, Z2 = 1, XZ1 = 0.5. The scenario with interaction was only considered when in (9) was normally distributed. Censoring times T C were generated from a Weibull hazard model using shape and scale parameter values = 4 and C , and independent of the covariates. Administrative censoring was imposed at time 15 years. The observed time for each individual is therefore T = min(T E , T C , 15) and D denotes the event indicator. We consider two sample sizes for the cohort of N = 5000 and N = 25, 000. For all six scenarios (two sample sizes, two distributions of X without interaction with Z 1 , one distribution of X with interaction with Z 1 ), the values of E and C were chosen to give approximately 250 cases in the cohort (5% of individuals having the event when N = 5000 and 1% having the event when N = 25,000) and approximately 35% of the individuals administratively censored at time 15 years. Values used for E and C are shown in Table S1 in the Supporting Information.
A nested case-control sample was obtained within the full cohort using all cases and one control per case. A case-cohort sample was obtained by including all cases and a random subcohort of 5% of the individuals when N = 5000 and 1% when N = 25, 000, resulting in the subcohort being approximately the same size as the number of controls in the nested case-control sample.
For both study designs we obtained a small superset sample and a large superset sample. For the nested case-control study we obtained a small superset sample by selecting three additional controls for each case, giving four controls per case in total. For the large superset sample we selected an additional 11 controls for each case, giving 12 controls per case in total. For the case-cohort study we obtained a small superset sample by adding an additional random subcohort of 15% of the individuals when N = 5000 and an additional random subcohort of 3% of the individuals when N = 25, 000, both of which correspond to a four-fold increase in the size of the subcohort. The large superset sample was obtained by increasing the size of the subcohort 12 fold, corresponding to an additional 55% of individuals when N = 5000 and an additional 11% of individuals when N = 25, 000.
In each simulated cohort, the covariate X was set to be missing for individuals not in the original nested case-control or case-cohort data sets, except for when a full-cohort analysis is being performed.

Target of analysis
The estimands of interest for the simulation study are the log hazard ratios (log HRs) X , Z1 , Z2 , and XZ1 (in scenarios with the interaction), and their SEs.

Methods
For each simulated full cohort we performed the following analyses: (i) a full cohort analysis assuming X, Z 1 , Z 2 are fully observed on all individuals; (ii) a traditional analysis of the nested case-control study and of the case-cohort study, excluding individuals outside the case-control sample; (iii) an analysis of the nested case-control or case-cohort sample in addition to the rest of the cohort, assuming X is observed only in the nested case-control or case-cohort sample and imputing X for individuals in the remainder of the full cohort using the methods described in Section 3.2; (iv) an analysis of the supersampled nested case-control or case-cohort study using a small superset, assuming X is missing for the individuals in the superset who are not in the original nested case-control or case-cohort study, and using the imputation methods described in Section 4; (v) as in (iv) but using the large superset.
In (iii) (Z 1 , Z 2 ) are fully observed for individuals in the full cohort who are not in the nested case-control or case-cohort study. In (iv) and (v) (Z 1 , Z 2 ) are fully observed in the supersampled nested case-control or case-cohort study. In all MI analyses we use 10 imputed data sets. In the MI-SMC analyses we used 500 iterations, and found that using a lower number (200 iterations for example) could lead to bias; in particular for situation (iii) with cohort size N = 25, 000. In the case-cohort study analyses we obtained estimates using both the Prentice and IPW estimators. When the Prentice estimator was used in the analysis, we also used the Prentice method to obtain the cumulative baseline hazard estimates in MI-SMC, and similarly for the IPW estimator; see the final paragraph of Section 2.2.

Performance measures
We generated n sim = 1000 data sets for each of the six full cohort scenarios. In each analysis we obtain estimates of the log HRs X , Z1 , Z2 , and XZ1 (in scenarios including the interaction term), and their model-based SEs and 95% confidence intervals (CIs). For each log HR we obtain an estimate of the bias n −1 sim ∑ n sim j=1 (̂j − ) and the empirical SD ('Emp SE') √ (n sim − 1) −1 ∑ n sim j=1 (̂j −̄) 2 , wherêj denotes the estimate from the jth simulated data set, denotes the true value, and denotes the average of the n sim estimates. We also obtained the average of the model-based estimates of the SEs ("Model SE") of the log HR estimates, and the 95% coverage, meaning the percentage of the n sim 95% CIs that contain the true value. A well-performing method would have bias close to zero, Emp SE similar to Model SE, and coverage close to the nominal level of 95%. To ease the comparison of the analysis methods, we for each method also report the root mean squared error ("RMSE") and the relative efficiency ("Rel Eff") of the log HR estimates relative to a full cohort analysis. RMSE is given as the square root of the sum of the squared bias and Emp SE 2 , and it provides information about the bias-variance trade-off for a method. Rel Eff is defined as the empirical variance from a full cohort analysis (Emp SE 2 ) divided by the empirical variance from a given analysis method. For methods that are approximately unbiased, Rel Eff informs us how a method compares to a full cohort analysis. For each performance measure we estimated the Monte Carlo SEs (Morris et al., 2019).

Software
The simulation was conducted in R and code enabling the simulation to be replicated are provided at https://github.com/ruthkeogh/supersampling. We used the survival package, including the cch function for analysis of case-cohort studies. The MI for the MI-Approx method was implemented using mice. For the MI-SMC imputation approach we used a modified version of the smcfcs function. For use with a case-cohort study (standard or supersampled) modifications were made to smcfcs to allow an analysis using the Prentice or IPW estimators. For use with a nested case-control study (standard or supersampled) modifications were made to smcfcs so that potential imputations of X considered in the rejection sampling are obtained from a regression of X on (Z 1 , Z 2 ) using data from the controls (which includes some cases) instead of noncases. A further modification is that the cumulative baseline hazard estimate used in the imputation procedure is obtained at the event time of the case in each individual's matched set, rather than at their own censoring time. The modified version of smcfcs also restricts the function to those components required for the analyses performed in this simulation, which can help others to more easily follow the code to gain a better understanding of the analysis.

Simulation results
Tables 1 and 2 show the results for the full cohort of size 25,000 for the nested case-control study for the situations without the X × Z 1 interaction and with the X × Z 1 interaction when in (9) is normally distributed. The results when there is no interaction and follows a lognormal distribution (shifted to have mean zero) are given in Table 3. Note that for the situations of Tables 1 and 2, the imputation model is correctly specified, while it is mis-specified for the situation of Table 3. Tables 4-6 show the corresponding results for the case-cohort setting using a full cohort of size 25,000. In the main text we show results from case-cohort analyses obtained using the IPW estimator. Results using the Prentice estimator were also obtained and are shown in Tables S2-S4. Results for the full cohort of size 5000 are shown in Tables S5-S7 for nested case-control and Tables S8-S10 for case-cohort Table numbers starting with S are given in the Supporting Information. As we expect, the full cohort analysis gives unbiased log HR estimates, correct SEs (comparing Emp SE with Model SE) and coverage at the nominal level.

Nested case-control results
We first consider the situation without interaction (Table 1) when the imputation model is correctly specified. Here a traditional nested case-control analysis gives a minor bias in the estimates for all parameters. But the root mean square errors are very close to the empirical SEs, so the bias is of little importance compared to the variability of the estimates. The averages of the model-based SEs ('Model SE') are close to the empirical SDs of the estimates ('Emp SE'), and the method has coverage around 95%. The relative efficiency of the log HR estimates relative to the full cohort estimates is somewhat below 20% for̂X and̂Z 1 and somewhat below 30% for̂Z 2 . Method (iii), which uses the full cohort with X imputed for the individuals who are not in the nested case-control sample, gives unbiased log HR estimates, correct SEs and fairly good coverage. The results are very similar using MI-Approx and MI-SMC. Method (iii) results in substantial gains in efficiency relative to the traditional analysis; the relative efficiencies are about 30% for̂X and about 70% for̂Z 1 and̂Z 2 . Thus, as known from earlier work, the gain in efficiency is largest for the coefficients of Z 1 and Z 2 which are assumed observed in the full cohort.
The methods using a supersampled nested case-control study perform well using both MI-Approx and MI-SMC, giving approximately unbiased log HR estimates both for a small and a large superset. The SEs are mainly correct and the coverage is close to 95%. An exception is the MI-SMC estimate of X , where the empirical SE is somewhat larger than the model-based SE and the coverage is 90%.
As may be expected, the efficiencies for the supersampled nested case-control studies relative to a full cohort analysis are between the relative efficiencies for the traditional nested case-control analysis and method (iii) with imputation for the full cohort. More specifically, we find that MI-Approx with a small/large superset was 79%/82% efficient for estimation of X relative to method (iii). For Z1 the corresponding relative efficiencies were 57%/76%, and for Z2 they were 80%/92%. This shows that a large part of the information contained in the full cohort may be extracted by using a supersample. Not surprisingly, using a large superset typically results in smaller SEs and higher relative efficiencies, especially for the coefficients of Z 1 and Z 2 , though the gains in efficiency are not substantial. Again an exception is the MI-SMC estimate of X , where the empirical SE is slightly larger for the large superset than the small superset. We then consider the situation with interaction (Table 2). Here a traditional nested case-control analysis gives somewhat biased estimates for all parameters, except for the interaction parameter XZ1 . But the root mean square errors are fairly close to the empirical SEs, so the biases are clearly of less importance than the variability of the estimates. The model SEs tend to be slightly too small, but coverage is close to 95%. The relative efficiency of the log HR estimates relative to the full cohort estimates are all between 8% and 17%. As noted in the last paragraph of Section 3.1, MI-approx does not apply for the interaction model. This is the reason why this imputation method gives substantial bias in parameter estimates (for the full cohort and the supersamples) for all parameters except Z2 . But MI-SMC performs well for the interaction model. Method (iii) with MI-SMC imputation for the full cohort gives unbiased estimates, correct SEs and good coverage. Further, there is a huge gain in efficiency relative to the traditional analysis; the relative efficiencies are around 60% for̂X and̂Z 2 , and around 90% for̂Z 1 and̂X Z1 .
The supersampled nested case-control studies perform fairly well for the interaction model when using MI-SMC. There is a small bias in the estimates of X , Z1 and XZ1 , and the model SEs tend to be slightly too large for the small superset. The coverage is fairly close to 95% for all parameters, but a bit higher for the small than the large superset. The relative efficiencies for the small superset are about 30% for̂X and̂Z 2 , 40% for̂Z 1 , and 20% for̂X Z1 . For the large superset the corresponding values are about 40%, 50%, and 25%. These relative efficiencies are clearly lower than those of method (iii). But the relative efficiencies for the small superset are about 2.5 times higher than those of the traditional nested case-control analysis, while they are at least three times higher for the large superset. So there is a lot to gain by using a supersampled nested case-control study compared to a traditional analysis.
Finally, we consider the situation where there is no interaction, but the imputation model is mis-specified (Table 3). Here a full cohort analysis gives similar results as for the correctly specified imputation model, except that the estimate of X has clearly lower SE than in Table 1. Also a traditional nested case-control analysis gives similar results as for the correctly specified imputation model, but the estimates are a bit more biased and the SEs are somewhat larger than in Table 1. Imputation using MI-Approx gives substantial bias for all parameter estimates both when X is imputed for the full cohort (method (iii)) and when imputation is restricted to a supersample (methods (iv) and (v)). Method (iii) with MI-SMC imputation for the full cohort also gives a substantial bias. But method (iv), which uses MI-SMC imputation with a small superset, gives almost unbiased estimates, correct SEs, and coverage close to 95%. Further, the relative efficiencies of method (iv) are more than twice those of the traditional nested case-control analysis. MI-SMC imputation with the large superset (method (v)) gives somewhat larger biases, and the model SE for the estimate of X is too low, which results in poor coverage for this parameter.
For the situation without interaction, the results for cohort size 5000 (Table S5) are quite similar to those obtained with cohort size 25,000. For the situation with interaction, the broad picture for cohort size 5000 (Table S6) is similar to that for cohort size 25,000. But the traditional method has slightly less bias for cohort size 5000 and most SEs are a bit smaller. The relative efficiencies of the supersampling methods using MI-SMC are similar or larger for cohort size 5000 compared to cohort size 25,000. Finally, when the imputation model is mis-specified, the results for cohort size 5000 (Table S7) are broadly speaking similar to those obtained with cohort size 25,000. But using MI-SMC with a large superset results in a larger bias for estimation of X and a smaller bias for estimation of Z2 .

Case-cohort results
We first consider the situation without interaction (Table 4). Here the traditional IPW estimator for case-cohort data gives a fairly large bias in the estimate of X and some bias in the estimate of Z1 . The model SEs tend to be too small, and coverage is too low, in particular for X and Z1 . Method (iii), which uses the full cohort with X imputed for the individuals who are not in the case-cohort sample, gives result which are close to those obtained with method (iii) for nested case-control data (Table 1). Thus, both for MI-Approx and MI-SMC, we obtain unbiased log HR estimates, correct SEs and good coverage. Also the relative efficiencies are close to those obtained with method (iii) for nested case-control data.
The methods using supersampled case-cohort data give approximately unbiased estimates for all parameters. For estimation of Z1 and Z2 , the empirical SEs are fairly close to those TA B L E 4 Case-cohort with inverse probability weighted estimator: Results for scenario with full cohort size N = 25, 000 and no interaction between X and Z 1 obtained for the supersampled nested case-control studies. But the model SEs are slightly larger, and as a consequence of this the coverages are a bit higher than 95%. For estimation of X , the empirical SEs are higher for MI-SMC than MI-Approx, but the model SEs are quite similar.
For the situation with interaction (Table 5), the traditional IPW estimator gives a fairly large bias in the estimates of X and Z1 , and the model SEs are too small for all estimates. As for nested case-control studies, MI-Approx gives substantial bias for the interaction model. But MI-SMC performs well, and for method (iii) with MI-SMC imputation for the full cohort we obtain results TA B L E 5 Case-cohort with inverse probability weighted estimator: Results for scenario with full cohort size N = 25000 and interaction between X and Z 1 that are close to those obtained with nested case-control data. For supersampled case-cohort data using MI-SMC, the model SEs tend to be somewhat too large and coverage too high. The empirical SEs are about the same as those obtained with supersampled nested case-control data, except for estimation of the interaction parameter XZ1 where the empirical SEs for supersampled nested case-control data are about 50% larger than for supersampled case-cohort data.
The results when there is no interaction, but the imputation model is mis-specified (Table 6), are fairly similar to those for nested case-control sampling. But the classical method for case-cohort has lower SEs for estimation of X than nested case-control, and the supersampled case-cohort data give a substantial bias in the estimate of Z2 both for a small and a large superset.
For cohort size 5000 and no interaction (Table S8), the traditional IPW estimator typically gives less biased estimates with smaller SEs compared to the results for cohort size 25,000. However, the estimates for the supersampling methods tend to be slightly more biased when N = 5000 than for N = 25000. But also for these estimates, the SEs tend to be somewhat smaller for N = 5000 than for N = 25, 000. For the situation with interaction (Table S9), both the traditional estimates and the supersampling estimates tend to be somewhat less biased and have smaller SEs for cohort size 5000 than for cohorts size 25,000. The relative efficiencies of the supersampling methods using MI-SMC are somewhat larger for cohort size 5000 compared to cohort size 25,000.
Tables S2-S4 show results from case-cohort analyses using the Prentice estimator. These show that typically the traditional Prentice estimator gives less bias, but larger SEs than the IPW estimator. For supersampled data there is no systematic difference in the biases of the two methods, but the IPW estimator gives clearly lower SEs than Prentice's estimator.

DISCUSSION
In this paper we have proposed a supersampling approach for nested case-control and case-cohort studies, in which a nested case-control or case-cohort sample is supplemented with additional controls. An expensive covariate is assumed to be measured only on the original nested case-control or case-cohort sample, but cheaply measured covariates are measured for all individuals in the supersample. We outlined a method for analysis of supersampled data, which involves imputing missing data on the expensive covariate for the individuals in the supersample who are not in the original nested case-control or case-cohort sample. Our focus was on two MI methods, one using an approximate imputation model (MI-Approx) and one that uses rejection sampling to obtain draws of the missing covariate from a posterior distribution that is compatible with the Cox proportional hazards model used for the main analysis (MI-SMC). Both imputation approaches were originally proposed to enable imputation of missing covariate data in Cox regression analyses based on full cohorts under a MAR assumption (Bartlett et al., 2015;White & Royston, 2009), before being modified for use in nested case-control and case-cohort samples with missing data . The supersampled nested case-control or case-cohort study can be considered as a nested case-control or case-cohort study with a larger number of controls and with missing data on the expensive covariate. Hence the MI-Approx and MI-SMC methods described by Keogh et al. (2018) can be applied. However, when imputing in a supersampled nested case-control study, one should use the time of the case to which a control is matched and not the control's own censoring time.
We used a simulation study to assess the proposed imputation methods for analysis of supersampled nested case-control and case-cohort studies, focusing on a large underlying cohort with a low event rate, which is realistic for situations in which these study designs are likely to be used. Small and large supersets were considered, in which the number of controls per case (nested case-control study) or the size of the subcohort (case-cohort study) was increased four fold or 12 fold compared with the original sample. We compared analyses of supersampled data using MI-Approx and MI-SMC with a full cohort analysis, with a traditional analysis of the original nested case-control or case-cohort data, and with an MI analysis of the nested case-control or case-cohort data where the expensive covariate is imputed for all individuals in the cohort who are not in the original nested case-control or case-cohort data sample. The supersampled data sets with small and large supersets can be viewed as lying in between the extremes of using the original sample with no information from the rest of the cohort, and using all available information from the rest of the cohort.
In the absence of an interaction term in the analysis model the MI-Approx and MI-SMC analyses of the supersampled studies both perform well when the imputation model is correctly specified, giving approximately unbiased estimates of association and approximately correct SEs and coverage. This was true for both nested case-control and case-cohort studies. The good performance of MI-Approx in these situations could be expected, since we consider a situation with a low event rate. We showed substantial gains in efficiency in the supersample analyses compared with the traditional analyses of nested case-control and case-cohort data. These gains were greater for the fully observed covariates compared to the covariate missing by design and imputed for the individuals in the supersample who were not in the original nested case-control or case-cohort sample.
Our simulation results for a correctly specified imputation model further showed that when the analysis model includes an interaction between the covariate with missingness and another covariate, MI-Approx gives biased estimates while MI-SMC gives approximately unbiased estimates. This was expected and has been shown in other studies using the two methods. Theory shows the MI-Approx method does not work well in combination with nonlinear terms in the analysis model, whereas MI-SMC handles this situation because the imputations are drawn from a distribution that is compatible with the analysis model. MI-Approx also gives biased estimates in the absence of an interaction term when the imputation model is mis-specified. For this situation also MI-SMC imputation for the full cohort gives a substantial bias. This illustrates that the consequences of mis-specifying the imputation model may be severe when imputing the X-values for a large cohort. But MI-SMC imputation for supersampled data gives quite good results, in particular for nested case-control for the small superset. Thus one advantage of supersampling compared with using information from the full cohort, is that the impact of a mis-specified imputation model will be less severe as a lower proportion of the data is imputed.
When the distribution of X given Z is skewed, as is the case for our mis-specified imputation model scenario, in additional investigations we explored whether the performance of MI-SMC imputation may be improved by assuming log X |Z to be normally distributed when performing the imputations, which approximately matches the data generating distribution for X. Then, as shown in Tables S11 and S12, results from MI-SMC imputation are improved for full cohort imputation method for both nested case-control and case-cohort studies. However, the estimates obtained for the supersampled data when using this model in the MI-SMC imputation tend to have more bias. We note that the true distribution of X|Z differs in the full cohort and in the supersampled data. These findings demonstrate the importance of checking that the assumed imputation model is approximately correctly specified.
The MI-SMC approach required a large number of iterations, and it is considerably more computationally intensive than MI-Approx; see Table S13 for an example of computing times for the various methods. Thus computational efficiency is an advantage of the MI-Approx method when there are no interactions (or other nonlinear terms). We used 500 iterations for the MI-SMC method, which is considerably larger than the standard 10 iterations that are typically used in MI in a general context. Nevertheless, small biases from the MI-SMC approach may be attributed to minor lack of convergence. It would be advantageous if the software could be extended to incorporate straightforward ways of assessing convergence in MI-SMC. When using MI-SMC we recommend using imputed values from MI-Approx as starting values for the iterative procedure.
In our description of the methods and in the simulation study we focused on a setting in which there is a single expensive covariate, X, that is observed only in the original nested case-control or case-cohort sample. The imputation methods used therefore only needed to impute a single covariate. In general there may be more than one covariate that can only be observed on the original sample. Both imputation methods used (MI-Approx and MI-SMC) extend easily to allow several covariates to be imputed, and both do this using the chained equations approach (Carpenter & Kenward, 2013;White et al., 2011).
There are a number of further investigations that would be of interest to consider in future work. We did not consider auxiliary variables that are correlated with the expensive missing covariate, but which are not of independent interest as covariates in the analysis model. Inclusion of such auxiliary variables in the imputation model has previously been found to improve efficiency of estimates of the coefficient for the expensive covariate, and would be of interest to consider in further work in the supersampling context. Further, we have only considered the classical versions of the nested case-control and case-cohort designs, where the controls/subcohort are selected by simple random sampling (Prentice, 1986;Thomas, 1977). Langholz and Borgan (1995) and Borgan et al. (2000) have developed stratified versions of the cohort sampling designs that make it possible to incorporate information on auxillary variabes into the sampling process in order to obtain more informative samples of controls/subcohort. One then classifies the cohort individuals into a number of strata according to their values of the auxiliary variables, and selects the controls/subcohort by stratified random sampling. In this way, one may increase the variation of the expensive covariate in the case-control sample, and thereby achieve a more efficient estimation of the effect of this covariate. It would be of interest to study the use of MI for supersampled cohort data, where the original sample and/or the supersample is selected by stratified random sampling.
To enable efficient use of data in supersampled nested case-control and case-cohort studies we focused on the use of MI. Alternative methods of analysis for supersampled studies would be of interest to consider. The supersampling approach could be viewed as a two-stage design in which a nested case-control or case-cohort sample is obtained (Phase I sample) and then certain information is only obtained in a Phase II sample, and one possible alternative analysis approach would be to use inverse probability weighting.
In summary, use of supersampling in nested case-control and case-cohort studies, combined with an MI analysis, can enable gains in efficiency over traditional analyses of these sampled cohort studies, which ignore information available on individuals in the cohort but not in the case-control sample. In many situations the cohort underlying a nested case-control or case-cohort sample is very large. Making use of data for the full cohort on cheaply measured covariates using MI may be extremely computationally expensive or even infeasible in this case. It may also result in bias if the imputation model is mis-specified. Supersampling provides a practical alternative which combines computational feasibility and robustness for mis-specified imputation models with good efficiency relative to using the full cohort information.