Optimizing Knowledge Distillation in Transformers: Enabling Power of Multi-Head Attention without Alignment Barriers

18 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge distillation; Multi-head Attention; Transformers; Generative models;
Abstract: Knowledge distillation has been proven effective for compressing transformer architectures by transferring knowledge from teacher to student models. Logits-based methods of knowledge distillation cannot fully capture the intermediate representations and features within the teacher model, which may result in the student model not fully learning all the knowledge from the teacher model. Thus, previous work focuses on transferring knowledge through intermediate features or attention maps. However, leveraging multi-head attention maps in transformers for knowledge distillation presents challenges due to head misalignment and suboptimal feature alignment, often requiring projectors to align features or special modifications to the model architecture. To address above limitations, we propose the Squeezing-Heads Distillation (SHD) method. This method reduces the number of attention maps to any desired number through linear approximation, without requiring additional projectors or parameters. This facilitates better alignment and knowledge transfer between models with different numbers of heads, enhancing both flexibility and efficiency. Experimental results demonstrate significant improvements in both language and vision generative models, validating the effectiveness of our method.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1443
Loading