Machine learning-driven integration of multimodal data for deciphering breast cancer heterogeneity

Thumbnail Image
Liu, Qian
Journal Title
Journal ISSN
Volume Title
Breast cancer (BC) is a complex disease with a high degree of heterogeneity. The heterogeneity of BC could be detected at different biological levels using a variety of modern molecular biological techniques. These biotechniques could generate high-throughput and quantitative measurements, such as gene expression, copy number variation (CNV), DNA methylation, proteomics measurements, and so on. Meanwhile, the tumor morphology information obtained from medical images is also worthy of consideration in evaluating the heterogeneity of BC. Many machine-learning algorithms have been developed to help us to explore the heterogeneity of cancer from the abovementioned high-dimensional measurements. However, there are several challenges for characterizing BC heterogeneity based on the multi-modal biodata using the existing computational data analysis techniques. The first challenge is how to effectively combine the multi-modal biodata and find comprehensive and interpretable representations from them. Another challenge is how to address the execution infeasibility caused by the unpaired data problem (the publicly available datasets have unmatched multi-omics, medical images, and clinical outcome data). Besides, the model interpretability and privacy issues should also be carefully taken into consideration in machine learning-based BC research. This thesis aims to explore BC heterogeneity using thriving machine-learning algorithms at different data resolutions ranging from single genomics, multi-genomics, to proteomics and radiogenomics. We have four major objectives: 1)human epidermal growth factor receptor2 positive/estrogen receptor-positive (HER2+/ER+) BC stratification and prognostic gene signature identification using single genomic data; 2) BC subtyping using multiple genomics data; 3) Graph neural network (GNN) for BC hierarchical biological system mapping using graph-structured proteomics data; 4) BC prognostic radiogenomic biomarker identification. This thesis demonstrates the promising applications of machine learning in deciphering BC heterogeneity at different biological levels. Moreover, the resulting 15-gene HER2+/ER+ BC gene expression signature, multi-omics-based BC subtypes, hierarchical biological systems/protein communities, and prognostic radiogenomic biomarkers have the potential to benefit clinical practice for BC.
Machine learning, Breast cancer, Medical imaging, Biomarker, Subtyping, Multi-omics