##### Abstract

Sampling techniques such as case-cohort and nested case-control

studies, allows us to analyze survival data under the relative risk model assumption, without having to collect data for the full cohort. This can reduce the cost of a large scale study

substantially. Simulations which investigates the performance of parameter-estimators in such studies, have been carried through extensively. Most concludes that the procedures o er great alternatives to a cohort study, using considerably less data, while suering only small losses in efficiency, Samuelsen et al.

(2007), Borgan and Samuelsen (2003), Self and Prentice (1988), Langholtz and Borgan (1995). In the nested case control studies, controls are sampled from the risk set on the event times, and tied to the specific case. Weighted pseudo likelihood estimators uses

the data from such a sampling, but in a more efficient way, which leads to more precision in the estimated parameters. If also there exist some additional information, a surrogate measure, that is available for the entire cohort, this may b e exploited in a strati ed sampling to achieve a more informative set of controls. This can lead to increased precision for the estimated parameters (Borgan et al.

2000). The estimated parameters are further used when we need to calculate cumulative hazard rates, or the estimated survival function. Thus if one method produces parameter estimates that are substantially better than another procedure (lower variance,

bias), the corresponding baseline estimator should inherit some of these merits. However, there could b e other influencing factors as well, leading to differences between the baseline estimators. While the estimated parameters are found by maximizing an overall likelihood, the estimator for the cumulative hazard is a function of timet, calculated at each observed event time. Such estimators may for certain values of to be more sensitive to the distribution of the controls over the time period. Thus an estimator may have go o d properties in some region of the observation time, but be lacking in other areas when compared to others. A third asp ect that could have an impact on the baseline estimator, is how data is used in the estimation at different t-values. More speciffically, the traditional estimator in a nested case-control study, will in early regions of the time period use far less data than the alternative methods

which pool the controls together. The latter will have a baseline estimator similar to the case-cohort study, which will from the rst observed event time use information from every case and sampled control. This could mean a more stable estimator early on, but may imply more variance being accumulated at later event times, leading to a drop in precision towards the end of the time perio d.

In this thesis, the properties of estimators of the cumulative baseline hazard, commonly referred to as Breslow-type estimators, are studied under various circumstances. The main goal will be to establish that the estimators works, and that we are able to obtain a correct estimate for the the variance. Then it is important to investigate which of the methods are yielding a more stable estimate. Since this may b e changing throughout the study period, 'more precise' may depend one where the focus lies. Concern are also raised to whether or not these estimators inherits the merits of the

corresponding parameter-estimators. The methods will b e studied using simulations, in addition to an application on a real dataset.

4 In the simulations we conducted, neither of the estimators have any large systematic errors. The precision of the estimators are also somewhat similar; no method are substantially better than the others, but the WPL alternatives are generally slightly more efficient than the standard nested case estimator. Additionally in some settings, certain estimators has an edge. For data that are left truncated, the case-cohort estimator is less precise than the rest, especially for later event times. A similar phenomena is seen in simulations where the hazard is very low at the beginning and increases during the study period. Experiments with the strati ed estimators shows that we are able to estimate the baseline even with such sampling schemes. The strati cation helps achieving a more efficient estimator for, which is translated to a more efficient baseline estimator. With more than one parameter, the strati cation improves the -estimate for the covariate linked to the strata, but the second parameter estimate is less precise. The corresponding strati ed baseline estimators are still performing better. This is not the case in the application, where only modest gains is seen for one parameter, which is o set by greater efficiency losses for the rest. The result is baseline estimators that are less precise than the non-strati ed methods.