A framework for the indexing, querying, clustering, and visualization of microbial genomes for surveillance and outbreak investigation

Loading...
Thumbnail Image
Date
2022-08-25
Authors
Petkau, Aaron
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Whole-genome sequencing (WGS) has increasingly become a routine part of monitoring infectious diseases. The genomes of bacteria, viruses, or other infectious agents are sequenced and used to identify nucleotide variants or other genetic differences—providing a wealth of detailed information. This has particularly become relevant with the COVID-19 pandemic, where sequencing of millions of viral genomes over the course of the pandemic has been essential in early identification of new viral lineages. The continuous generation of WGS data at this scale has introduced a number of challenges for efficiently generating timely reports and searching for epidemiologically significant patterns. I have designed and implemented a framework to address these problems—the Genomics Data Index (https://github.com/apetkau/genomics-data-index)—which uses ideas from the field of information retrieval to transform WGS data into a collection of genomics features (nucleotide variants, kmers, and genes) and index these features for rapid querying. I provide a command-line interface and Python API for incrementally adding new data and querying the index. The query API integrates with existing methods for working with tabular and phylogenetic data to provide a common interface for clustering, visualization, and statistical analysis of microbial genomes. I evaluated this framework using three datasets containing assembled genomes and sequence reads. Indexing assemblies was more sensitive for nucleotide variant detection when there were fewer variants (sensitivity = 0.948 for 6.77% divergence compared to reads sensitivity = 0.663), but sensitivity when indexing with reads surpassed assemblies as variants increased. The software was able to scale to tens of thousands of SARS-CoV-2 genomes (2.17 hours for loading 20,000 genomes) and construct phylogenies consistent with the existing Pangolin lineage system. Constructing phylogenies using nucleotide variants derived from bacterial WGS reads was found to consistently group outbreak-related bacteria into monophyletic clades (4/4 correct clades), while kmer clustering was able to group bacteria only at the species level. I have already applied this software to aid in investigating new lineages of SARS-CoV-2, and I believe this software will be of great benefit for future research and genomic surveillance of other infectious diseases.
Description
Keywords
microbial genomics, information retrieval, clustering, visualization, epidemiology, indexing, querying, genomic features, whole-genome sequencing
Citation