Sparse KD: Knowledge Distillation for Sparse Models in Constrained Fine-tuning ScenariosDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large language models (LLMs) have achieved tremendous success in various domains, but their massive parameter sizes pose challenges for fine-tuning and inference. Recently, the common model compression process involves obtaining a sparse LLM through pruning, followed by LoRA-finetuning. However, these methods often suffer from significant performance degradation. We attempted to address this by introducing additional teacher distillation, but found limited improvements due to the gap between the teacher and student models and constrained training iterations. To overcome these challenges, we propose Sparse KD, the first distillation framework specifically designed for sparse models in constrained fine-tuning scenarios. Our framework includes dynamic temperature, knowledge alignment, and Bayesian distillation optimization strategies. Dynamic temperature can adaptively align the strength of the teacher's knowledge, and the Knowledge Alignment Module can bridge the gap by projecting teacher-student knowledge to the same interval. Applying Bayesian optimization swiftly finds optimal settings based on these strategies, thereby improving model performance. Comprehensive experiments across diverse task types have demonstrated that this combination can be applied to LLMs with effective and stable results.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview