Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means

dc.contributor.authorHadipour, Hamid
dc.contributor.authorLiu, Chengyou
dc.contributor.authorDavis, Rebecca
dc.contributor.authorCardona, Silvia T.
dc.contributor.authorHu, Pingzhao
dc.date.accessioned2022-05-01T03:20:35Z
dc.date.issued2022-04-15
dc.date.updated2022-05-01T03:20:35Z
dc.description.abstractAbstract Background Converting molecules into computer-interpretable features with rich molecular information is a core problem of data-driven machine learning applications in chemical and drug-related tasks. Generally speaking, there are global and local features to represent a given molecule. As most algorithms have been developed based on one type of feature, a remaining bottleneck is to combine both feature sets for advanced molecule-based machine learning analysis. Here, we explored a novel analytical framework to make embeddings of the molecular features and apply them in the clustering of a large number of small molecules. Results In this novel framework, we first introduced a principal component analysis method encoding the molecule-specific atom and bond information. We then used a variational autoencoder (AE)-based method to make embeddings of the global chemical properties and the local atom and bond features. Next, using the embeddings from the encoded local and global features, we implemented and compared several unsupervised clustering algorithms to group the molecule-specific embeddings. The number of clusters was treated as a hyper-parameter and determined by the Silhouette method. Finally, we evaluated the corresponding results using three internal indices. Applying the analysis framework to a large chemical library of more than 47,000 molecules, we successfully identified 50 molecular clusters using the K-means method with 32 embeddings based on the AE method. We visualized the clustering result via t-SNE for the overall distribution of molecules and the similarity maps for the structural analysis of randomly selected cluster-specific molecules. Conclusions This study developed a novel analytical framework that comprises a feature engineering scheme for molecule-specific atomic and bonding features and a deep learning-based embedding strategy for different molecular features. By applying the identified embeddings, we show their usefulness for clustering a large molecule dataset. Our novel analytic algorithms can be applied to any virtual library of chemical compounds with diverse molecular structures. Hence, these tools have the potential of optimizing drug discovery, as they can decrease the number of compounds to be screened in any drug screening campaign.
dc.identifier.citationBMC Bioinformatics. 2022 Apr 15;23(Suppl 4):132
dc.identifier.urihttps://doi.org/10.1186/s12859-022-04667-1
dc.identifier.urihttp://hdl.handle.net/1993/36444
dc.language.rfc3066en
dc.rightsopen accessen_US
dc.rights.holderThe Author(s)
dc.titleDeep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
dc.typeJournal Article
local.author.affiliationRady Faculty of Health Sciencesen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
12859_2022_Article_4667.pdf
Size:
5.56 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.24 KB
Format:
Item-specific license agreed to upon submission
Description: