Why does Knowledge Distillation work? Rethink its attention and fidelity mechanism

Published: 01 Jan 2025, Last Modified: 15 May 2025Expert Syst. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Good student performance does not imply good student-teacher fidelity.•Low student-teacher fidelity in KD is caused by the teachers’ attention divergence.•Low-fidelity in KD can hardly be mitigated with logits-matching optimization.•Diverse attentional patterns in teachers can improve students’ generalization.•Examining data augmentation’s effects on learning dynamics of in ensemble KD.
Loading