Abstract: Large contrastive learning models, e.g., Sentence-T5, tend to be proposed to learn more powerful sentence embeddings recently. Though effective, such large models are hard to serve online due to computational resources or time cost limits. Knowledge distillation can compress a large ``teacher'' model into a small ``student'' model, but it generally suffers from performance decrease. To tackle that, we propose an effective knowledge distillation framework for contrastive sentence embeddings, termed DistilCSE. It first utilizes knowledge distillation to transfer the capability of a large contrastive learning model to a small student model on a large amount of unlabeled data, and then finetunes the student model with contrastive learning on limited labeled data.We further propose Contrastive Knowledge Distillation (CKD) to enhance the training objective consistencies among teacher model training, knowledge distillation, and student model finetuning, which can improve performance like prompt learning. Extensive experiments on seven semantic textual similarity benchmarks show that student models trained with the proposed DistilCSE and CKD suffer from little or even no performance decrease and consistently outperform the corresponding counterparts of the same parameter size. Amazingly, our 110M student model can even outperform the latest state-of-the-art (SOTA) model, i.e., Sentence-T5(11B), with only 1% parameters.
Paper Type: long
0 Replies
Loading