Classifying SARS-CoV-2 and common respiratory viruses from genome assemblies

Rahman, Mohaimen
Polymerase chain reaction (PCR) testing has widespread use in the systematic identification of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) strains. However, another approach for identifying the SARS-CoV-2 virus is by the machine learning classification of genome sequences, which has shown promising results. While trained clinicians usually perform the classification of genome sequences, a machine learning classifier can be used to complement the process and provide a short list for further analysis. A machine learning approach can provide a unique fingerprint of base pairs and yield a quick classification. To this end, we investigated a k-mer approach in order to classify genome sequences of SARS-CoV-2 and common respiratory viruses, as well as a Human genome sequence. We aim to provide a simplified classification approach that balances validation time while limiting hyperparameter tuning. Our approach achieved F1 scores in excess of 0.99, and perfect scores between the common respiratory viruses. We demonstrated a simple 5-base sub-sequencing scheme which has the power to differentiate over 7.91 million sequences from almost 20 thousand genome assemblies.
virus dna classification, classification of sars-cov2 and other respiratory viruses