Abstract: In this work, we present a novel approach to simultaneous knowledge transfer and model compression called \textbf{Weight Squeezing}. With this method, we perform knowledge transfer from a teacher model \textbf{by learning the mapping from its weights to smaller student model weights}.We applied Weight Squeezing to a pre-trained text classification model based on a BERT-Medium model. We compared our method to various other knowledge transfer and model compression methods using the GLUE multitask benchmark. We observed that our approach produces better results while being significantly faster than other methods for training student models.We also proposed a variant of Weight Squeezing called Gated Weight Squeezing, in which we combined fine-tuning a small BERT model and learning mapping from larger BERT weights. We showed that, in most cases, fine-tuning a BERT model with Gated Weight Squeezing can outperform plain fine-tuning.
0 Replies
Loading