Data Integration in Penalized Regression Models : with application to genomics

Bergersen, Linn Cecilie

Master thesis

View/Open

Masteroppgave.pdf (3.859Mb)

Year

2009

Abstract

New challenges within statistical sciences have arisen with the explosive growth of information. Classical methods are not designed for these kinds of problems and may not be possible to use or may not behave as expected.

In regression analysis, having a very large number of explanatory variables p when the sample size n is small, will not be in accordance with the assumptions in the usual regression model, where p≤n. A lot of novel and effective strategies have been established to circumvent this problem, and shrinkage methods are one approach which is commonly used when doing regression with p>n, or even p>>n.

As the volume of existing data expands, an increased interest in data integration has also aroused. Methods combining information from different data sources could be of great relevance and importance in different scientific fields. One area where high-dimensional data frequently occur is within biology and medicine. Large high-dimensional data sets with thousands of covariates are a result of the great advances and new methods in biotechnology which are able to conduct high-throughput experiments of gene expression and other biological features of interest. The underlying aim analyzing these data, is to search for novel biomarkers which can be used to predict outcome of a disease for future patients.

Incorporating more than one type of such biological high-dimensional data in a single model may therefore be appropriate. By effectively taking advantage of known underlying biological processes, the idea of using more of the information available is just as beneficial from a biological point of view as from a statistical perspective.

The aim of this thesis is to propose a model for data integration of high-dimensional data in a regression setting where p>n. The suggested method will be a shrinkage method with

L1-penalties of the lasso type. By introducing penalty terms which could be uniquely defined for each covariate, the model may provide different amounts of shrinkage to the regression coefficients based on external information from additional data sources.

The model will be presented in a biological context and applied to a high-dimensional data set The Radium Hospital Cervix Cancer Cohort Data. The data set includes survival data and both gene expression measurements and aCGH data for patients diagnosed with cervical cancer at the Norwegian Radium Hospital in the period 2001-2004. The intent is to identify genes which are important for survival and to study the possibility of predicting the outcome for future patients.