Fine-tuning an Automatic Speech Recognition model for a Canadian Indigenous counselling program

dc.contributor.authorOlaniyanu, Emmanuel
dc.contributor.examiningcommitteeYahampath, Pradeepa (Electrical and Computer Engineering)
dc.contributor.examiningcommitteeLithgow, Brian (Electrical and Computer Engineering)
dc.contributor.examiningcommitteeMarzban, Hassan (Human Anatomy and Cell Science)
dc.contributor.supervisorMoussavi, Zahra
dc.date.accessioned2025-01-06T17:17:19Z
dc.date.available2025-01-06T17:17:19Z
dc.date.issued2025-01-03
dc.date.submitted2025-01-03T19:29:31Zen_US
dc.degree.disciplineBiomedical Engineering
dc.degree.levelMaster of Science (M.Sc.)
dc.description.abstractAutomatic Speech Recognition (ASR) systems have seen a marked improvement in performance after adopting the End-To-End (E2E) approach. However, the performance of ASR models is still largely dependent on the quantity and quality of data they are trained on. Popular ASR systems today are trained on thousands of hours of data but consistently fail to maintain their performance when exposed to outlier accents, vocal pitches, and demographic speech. While ASR systems have greatly improved in recent years, the biggest hurdle remains the lack of niche speech data. Most available speech data fall into a few voice types and are not representative of the average ASR system user. The majority of machine learning projects require large amounts of data. However, the adaptability of E2E ASR models allows them to be fine-tuned to outlier speech using small amounts of representative data. The project presented in this thesis aimed to fine-tune an ASR model for use in an Indigenous Counselling Program. An open-source ASR system called Mozilla DeepSpeech was used to train the models used for the project. Representative speech data was gathered from audiobooks and organized into speech corpora to fine-tune DeepSpeech’s pre-trained models. DeepSpeech’s live transcription software was also implemented on the Unity Development Engine to ensure compatibility with the Indigenous Counselling Program. Three models were trained using the new speech data. Two models were trained using pitch frequency-specific data. One general model was trained using all the new speech data. The results showed a minimum average relative WER or WER improvement of 8.90% for all the models, on the dataset they were trained on. Furthermore, the general model showed little to no improvement in performance over the pitch specific models when tested on their trained datasets. This demonstrated the importance of using representative speech data in ASR model training. Overall, the models showed a marked improvement in performance when trained on the intended user accent and voice type.
dc.description.noteFebruary 2025
dc.identifier.urihttp://hdl.handle.net/1993/38749
dc.language.isoeng
dc.rightsopen accessen_US
dc.subjectAutomatic Speech Recognition
dc.subjectAccent
dc.subjectIndigenous Canadian Accent
dc.subjectMachine Learning
dc.subjectDeep Learning
dc.subjectFine-Tuning
dc.subjectASR
dc.subjectData Science
dc.titleFine-tuning an Automatic Speech Recognition model for a Canadian Indigenous counselling program
dc.typemaster thesisen_US
local.subject.manitobayes
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Olaniyanu_Emmanuel.pdf
Size:
1.2 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
770 B
Format:
Item-specific license agreed to upon submission
Description: