Sparse Bayesian learning for predicting phenotypes and identifying influential markers

Thumbnail Image
Ayat, Maryam
Journal Title
Journal ISSN
Volume Title
In bioinformatics, Genomic Selection (GS) and Genome-Wide Association Studies (GWASs) are two related problems that can be applied to the plant breeding industry. GS is a method to predict phenotypes (i.e., traits) such as yield and disease resistance in crops from high-density markers positioned throughout the genome of the varieties. By contrast, a GWAS involves identifying markers or genes that underlie the phenotypes of importance in breeding. The need to accelerate the development of improved varieties, and challenges such as discovering all sorts of genetic factors related to a trait, increasingly persuade researchers to apply state-of-the-art machine learning methods to GS and GWASs. The aim of this study is to employ sparse Bayesian learning as a technique for GS and GWAS. The sparse Bayesian learning uses Bayesian inference to obtain sparse solutions in regression or classification problems. This learning method is also called the Relevance Vector Machine (RVM), as it can be viewed as a kernel-based model of identical form to the renowned Support Vector Machine (SVM) method. The RVM has some advantages that the SVM lacks, such as having probabilistic outputs, providing a much sparser model, and the ability to work with arbitrary kernel functions. However, despite the advantages, there is not enough research on the applicability of the RVM. In this thesis, we define and explore two different forms of the sparse Bayesian learning for predicting phenotypes and identifying the most influential markers of a trait, respectively. Particularly, we introduce a new framework based on sparse Bayesian learning and ensemble technique for ranking influential markers of a trait. We apply our methods on three different datasets, one simulated dataset and two real-world datasets (yeast and flax), and analyze our results with respect to the existing related works, trait heritability, and the accuracies obtained from the use of different kernel functions including linear, Gaussian, and string kernels, if applicable. We find that the RVMs can not only be considered as good as other successful machine learning methods in phenotype prediction, but are also capable of identifying the most important markers from which biologists might gain insight.
Sparse Bayesian Learning, Relevance Vector Machine, Phenotype prediction, Ranking features, Marker identification