Fine-tuning an Automatic Speech Recognition model for a Canadian Indigenous counselling program

Loading...
Thumbnail Image
Date
2025-01-03
Authors
Olaniyanu, Emmanuel
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Automatic Speech Recognition (ASR) systems have seen a marked improvement in performance after adopting the End-To-End (E2E) approach. However, the performance of ASR models is still largely dependent on the quantity and quality of data they are trained on. Popular ASR systems today are trained on thousands of hours of data but consistently fail to maintain their performance when exposed to outlier accents, vocal pitches, and demographic speech. While ASR systems have greatly improved in recent years, the biggest hurdle remains the lack of niche speech data. Most available speech data fall into a few voice types and are not representative of the average ASR system user. The majority of machine learning projects require large amounts of data. However, the adaptability of E2E ASR models allows them to be fine-tuned to outlier speech using small amounts of representative data. The project presented in this thesis aimed to fine-tune an ASR model for use in an Indigenous Counselling Program. An open-source ASR system called Mozilla DeepSpeech was used to train the models used for the project. Representative speech data was gathered from audiobooks and organized into speech corpora to fine-tune DeepSpeech’s pre-trained models. DeepSpeech’s live transcription software was also implemented on the Unity Development Engine to ensure compatibility with the Indigenous Counselling Program. Three models were trained using the new speech data. Two models were trained using pitch frequency-specific data. One general model was trained using all the new speech data. The results showed a minimum average relative WER or WER improvement of 8.90% for all the models, on the dataset they were trained on. Furthermore, the general model showed little to no improvement in performance over the pitch specific models when tested on their trained datasets. This demonstrated the importance of using representative speech data in ASR model training. Overall, the models showed a marked improvement in performance when trained on the intended user accent and voice type.

Description
Keywords
Automatic Speech Recognition, Accent, Indigenous Canadian Accent, Machine Learning, Deep Learning, Fine-Tuning, ASR, Data Science
Citation