Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Yao Fu; Yu Yin; Runchao Li; Xianxuan Long; Haotian Yu; Pan Li

Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Yao Fu, Yu Yin, Runchao Li, Xianxuan Long, Haotian Yu, Pan Li

26 Sept 2024 (modified: 19 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Self-Distillation, Pre-trained Language Models, Large Language Models, Small Language Models, NLP, Fine-tuning

TL;DR: A model-agnostic and task-agnostic self-distillation method via the previous mini-batch's information.

Abstract: Knowledge Distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprint. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self Distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous mini-batch (DynSDPB), which realizes current iterations’ distillation from the last ones’ generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, we propose Vocabulary Map Matching (VMM), aiming to address output inconsistency for auto-regressive LLMs. Last but not least, DynSDPB facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs). We apply DynSDPB to both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8233

Loading