Fine-tuning an Automatic Speech Recognition model for a Canadian Indigenous counselling program

Olaniyanu, Emmanuel

Fine-tuning an Automatic Speech Recognition model for a Canadian Indigenous counselling program

dc.contributor.author	Olaniyanu, Emmanuel
dc.contributor.examiningcommittee	Yahampath, Pradeepa (Electrical and Computer Engineering)
dc.contributor.examiningcommittee	Lithgow, Brian (Electrical and Computer Engineering)
dc.contributor.examiningcommittee	Marzban, Hassan (Human Anatomy and Cell Science)
dc.contributor.supervisor	Moussavi, Zahra
dc.date.accessioned	2025-01-06T17:17:19Z
dc.date.available	2025-01-06T17:17:19Z
dc.date.issued	2025-01-03
dc.date.submitted	2025-01-03T19:29:31Z	en_US
dc.degree.discipline	Biomedical Engineering
dc.degree.level	Master of Science (M.Sc.)
dc.description.abstract	Automatic Speech Recognition (ASR) systems have seen a marked improvement in performance after adopting the End-To-End (E2E) approach. However, the performance of ASR models is still largely dependent on the quantity and quality of data they are trained on. Popular ASR systems today are trained on thousands of hours of data but consistently fail to maintain their performance when exposed to outlier accents, vocal pitches, and demographic speech. While ASR systems have greatly improved in recent years, the biggest hurdle remains the lack of niche speech data. Most available speech data fall into a few voice types and are not representative of the average ASR system user. The majority of machine learning projects require large amounts of data. However, the adaptability of E2E ASR models allows them to be fine-tuned to outlier speech using small amounts of representative data. The project presented in this thesis aimed to fine-tune an ASR model for use in an Indigenous Counselling Program. An open-source ASR system called Mozilla DeepSpeech was used to train the models used for the project. Representative speech data was gathered from audiobooks and organized into speech corpora to fine-tune DeepSpeech’s pre-trained models. DeepSpeech’s live transcription software was also implemented on the Unity Development Engine to ensure compatibility with the Indigenous Counselling Program. Three models were trained using the new speech data. Two models were trained using pitch frequency-specific data. One general model was trained using all the new speech data. The results showed a minimum average relative WER or WER improvement of 8.90% for all the models, on the dataset they were trained on. Furthermore, the general model showed little to no improvement in performance over the pitch specific models when tested on their trained datasets. This demonstrated the importance of using representative speech data in ASR model training. Overall, the models showed a marked improvement in performance when trained on the intended user accent and voice type.
dc.description.note	February 2025
dc.identifier.uri	http://hdl.handle.net/1993/38749
dc.language.iso	eng
dc.rights	open access	en_US
dc.subject	Automatic Speech Recognition
dc.subject	Accent
dc.subject	Indigenous Canadian Accent
dc.subject	Machine Learning
dc.subject	Deep Learning
dc.subject	Fine-Tuning
dc.subject	ASR
dc.subject	Data Science
dc.title	Fine-tuning an Automatic Speech Recognition model for a Canadian Indigenous counselling program
dc.type	master thesis	en_US
local.subject.manitoba	yes

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Olaniyanu_Emmanuel.pdf
Size:: 1.2 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 770 B
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

FGS - Electronic Theses and Practica
Manitoba Heritage Theses