Abstract: Highlights•Good student performance does not imply good student-teacher fidelity.•Low student-teacher fidelity in KD is caused by the teachers’ attention divergence.•Low-fidelity in KD can hardly be mitigated with logits-matching optimization.•Diverse attentional patterns in teachers can improve students’ generalization.•Examining data augmentation’s effects on learning dynamics of in ensemble KD.
Loading