Distilling knowledge through student-teacher model and BERT for sentiment analysis

Loading...
Thumbnail Image
Date
2022-12-04
Authors
Dong, Ximing
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Bi-directional Encoder Representations from Transformers (BERT) is the state-of-the-art deep learning model for pre-training natural language processing (NLP) tasks such as sentiment analysis. The BERT model dynamically generates word representations according to the context and semantics using its bi-directional and attention mechanism features. The model, although, improves precision on NLP tasks, is compute-intensive and time-consuming to deploy on mobile or smaller platforms. In this thesis, to address this issue, we use knowledge distillation (KD), a "teacher-student" training technique, to compress the model. We use the BERT model as the "teacher" model to transfer knowledge to student models, ``first-generation'' convolution neural networks, and long-short term memory with attention mechanism (LSTM-atten). We conduct various experiments on sentiment analysis benchmark data sets and show that the “student models” through knowledge distillation have better performance with 70% improvement in accuracy, precision, recall, and F1-score compared to models without KD. We also investigate the convergence rate of student models and compare the results to the existing models in the literature. Finally, we show that compared to the full-size BERT model, our RNN series models are 50 times smaller in size and retain approximately 96% performance on benchmark data sets.

Description
Keywords
Natural Language Processing
Citation