Spark-based data analytics of sequence motifs in large omics data

Sarumi, Oluwafemi; Leung, Carson; Adetunmbi, Adebayo

Spark-based data analytics of sequence motifs in large omics data

Files

Sarumi_JProCS_126_2018.pdf(618.58 KB)

Date

2018

Authors

Sarumi, Oluwafemi

Leung, Carson

Adetunmbi, Adebayo

Publisher

Elsevier

Abstract

Data explosion in bioinformatics in recent years has led to new challenges for researchers to develop novel techniques to discover new knowledge from the avalanche of omics data (e.g., genomics, proteomics, transcriptomics). These data are embedded with a wealth of information including frequently repeated patterns (i.e., sequence motifs). In genomics, deoxyribonucleic acid (DNA) sequence motifs are short repeated contiguous frequent subsequences located in the prompter region. Due to the high volume and various degrees of veracity of these DNA datasets generated by the next-generation sequencing techniques, sequence motif mining from DNA sequences poised a major challenge in bioinformatics. In this article, we present a distributed sequential algorithm—which uses the MapReduce programming model on a cluster of homogeneous distributed-memory system running on an Apache Spark computing framework—for DNA sequence motif mining. Experimental results show the effectiveness of our algorithm in Spark-based data analytics of sequence motifs in large omics data.

Keywords

bioinformatics, Spark, MapReduce, deoxyribonucleic acid (DNA), genomics, sequence motifs

Citation

O.A. Sarumi, C.K. Leung, A.O. Adetunmbi. Spark-based data analytics of sequence motifs in large omics data. Procedia Computer Science, 126 (2018), pp. 596-605

URI

http://hdl.handle.net/1993/33656

Collections

University of Manitoba Scholarship
Faculty of Science Scholarly Works

Full item page