Integrating Biological Domain Knowledge in Machine Learning Models for Cancer Precision Medicine

Kielland, Anders

Master thesis

View/Open

anders_kielland_masteroppgave.pdf (1.619Mb)

Year

2023

Metadata

Show metadata

Appears in the following Collection

Matematisk institutt [1064]

Abstract

Cancer is an incredibly complex and diverse disease. Therefore, medical treatment preferentially should be tailored at the level of individual patients. There exists a vast amount of knowledge related to cancer biology, diagnosis, and treatment, and an extensive amount of measurements can easily be performed on each patient. A key challenge is to utilize such large amounts of information to design the most precise treatments. This thesis addresses this problem by analyzing data from a clinical trial on breast cancer treatment. The trial investigated a combination of hormone therapy with a targeted drug that specifically inhibits CDK4/6, a protein involved in estrogen-stimulated cell proliferation. The trial included 49 patients, with measurements of 771 gene expression levels. The outcomes were two continuous scores which aimed to quantify cancer cell proliferation and long-term prognosis. We have compared various machine learning models, both alone and in combination with domain biological knowledge, to assess their predictive power for cancer treatment outcomes. Furthermore, we evaluated the integration of machine learning models with a mechanistic mathematical model characterizing the mechanisms of action of the targeted drug. Finally, we explored the use of domain knowledge in a novel model approach. Among the standard model classes - ridge regression, lasso, elastic net, and boosting with stumps as base learners - ridge demonstrated the best predictive performance. Feature selection revealed high overlap between lasso and elastic net, while boosting showed an overlap of approximately half with the two linear models. The integration of mechanistic and machine learning models did not improve upon the standard models. To leverage biological knowledge, the gene set was divided into smaller subsets based on each gene's involvement in different aspects of breast cancer biology, such as regulation of cell proliferation, estrogen signaling, immune system activity, and DNA repair mechanisms. The smaller gene subsets underwent feature engineering through principal component analysis, and the resulting components were used as covariates in the standard machine learning models. This led to a slight improvement in predictive power and offered some insights into the importance of different aspects of breast cancer biology. We also included interaction terms between principal components from different gene sets, which further improved predictive performance. In a second attempt to utilize biological knowledge, we employed a stacking-like approach by first training models on the gene sets individually, and then using the predictions of these models, each representing a gene set, as input features for a new machine learning model. This method did not outperform the best standard model. Lastly, inspired by the potential of modeling interactions between functional units of cancer biology, we attempted a novel iterative approach focusing on these interactions. This method showed promising results on simulated data with more observations than features but faced challenges when the number of observation became too small.