Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means

Hadipour, Hamid; Liu, Chengyou; Davis, Rebecca; Cardona, Silvia T.; Hu, Pingzhao

Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means

dc.contributor.author	Hadipour, Hamid
dc.contributor.author	Liu, Chengyou
dc.contributor.author	Davis, Rebecca
dc.contributor.author	Cardona, Silvia T.
dc.contributor.author	Hu, Pingzhao
dc.date.accessioned	2022-05-01T03:20:35Z
dc.date.issued	2022-04-15
dc.date.updated	2022-05-01T03:20:35Z
dc.description.abstract	Abstract Background Converting molecules into computer-interpretable features with rich molecular information is a core problem of data-driven machine learning applications in chemical and drug-related tasks. Generally speaking, there are global and local features to represent a given molecule. As most algorithms have been developed based on one type of feature, a remaining bottleneck is to combine both feature sets for advanced molecule-based machine learning analysis. Here, we explored a novel analytical framework to make embeddings of the molecular features and apply them in the clustering of a large number of small molecules. Results In this novel framework, we first introduced a principal component analysis method encoding the molecule-specific atom and bond information. We then used a variational autoencoder (AE)-based method to make embeddings of the global chemical properties and the local atom and bond features. Next, using the embeddings from the encoded local and global features, we implemented and compared several unsupervised clustering algorithms to group the molecule-specific embeddings. The number of clusters was treated as a hyper-parameter and determined by the Silhouette method. Finally, we evaluated the corresponding results using three internal indices. Applying the analysis framework to a large chemical library of more than 47,000 molecules, we successfully identified 50 molecular clusters using the K-means method with 32 embeddings based on the AE method. We visualized the clustering result via t-SNE for the overall distribution of molecules and the similarity maps for the structural analysis of randomly selected cluster-specific molecules. Conclusions This study developed a novel analytical framework that comprises a feature engineering scheme for molecule-specific atomic and bonding features and a deep learning-based embedding strategy for different molecular features. By applying the identified embeddings, we show their usefulness for clustering a large molecule dataset. Our novel analytic algorithms can be applied to any virtual library of chemical compounds with diverse molecular structures. Hence, these tools have the potential of optimizing drug discovery, as they can decrease the number of compounds to be screened in any drug screening campaign.
dc.identifier.citation	BMC Bioinformatics. 2022 Apr 15;23(Suppl 4):132
dc.identifier.uri	https://doi.org/10.1186/s12859-022-04667-1
dc.identifier.uri	http://hdl.handle.net/1993/36444
dc.language.rfc3066	en
dc.rights	open access	en_US
dc.rights.holder	The Author(s)
dc.title	Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
dc.type	Journal Article
local.author.affiliation	Rady Faculty of Health Sciences	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 12859_2022_Article_4667.pdf
Size:: 5.56 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.24 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

University of Manitoba Scholarship
Rady Faculty of Health Sciences Scholarly Works