Sparse KD: Knowledge Distillation for  Sparse Models in Constrained Fine-tuning Scenarios

Anonymous

Sparse KD: Knowledge Distillation for Sparse Models in Constrained Fine-tuning Scenarios

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Large language models (LLMs) have achieved tremendous success in various domains, but their massive parameter sizes pose challenges for fine-tuning and inference. Recently, the common model compression process involves obtaining a sparse LLM through pruning, followed by LoRA-finetuning. However, these methods often suffer from significant performance degradation. We attempted to address this by introducing additional teacher distillation, but found limited improvements due to the gap between the teacher and student models and constrained training iterations. To overcome these challenges, we propose Sparse KD, the first distillation framework specifically designed for sparse models in constrained fine-tuning scenarios. Our framework includes dynamic temperature, knowledge alignment, and Bayesian distillation optimization strategies. Dynamic temperature can adaptively align the strength of the teacher's knowledge, and the Knowledge Alignment Module can bridge the gap by projecting teacher-student knowledge to the same interval. Applying Bayesian optimization swiftly finds optimal settings based on these strategies, thereby improving model performance. Comprehensive experiments across diverse task types have demonstrated that this combination can be applied to LLMs with effective and stable results.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

0 Replies

Loading