Classifying SARS-CoV-2 and common respiratory viruses from genome assemblies
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Polymerase chain reaction (PCR) testing has widespread use in the systematic identification of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) strains. However, another approach for identifying the SARS-CoV-2 virus is by the machine learning classification of genome sequences, which has shown promising results. While trained clinicians usually perform the classification of genome sequences, a machine learning classifier can be used to complement the process and provide a short list for further analysis. A machine learning approach can provide a unique fingerprint of base pairs and yield a quick classification. To this end, we investigated a k-mer approach in order to classify genome sequences of SARS-CoV-2 and common respiratory viruses, as well as a Human genome sequence. We aim to provide a simplified classification approach that balances validation time while limiting hyperparameter tuning. Our approach achieved F1 scores in excess of 0.99, and perfect scores between the common respiratory viruses. We demonstrated a simple 5-base sub-sequencing scheme which has the power to differentiate over 7.91 million sequences from almost 20 thousand genome assemblies.