Keywords: Knowledge Distillation, Dynamic Weighting
Abstract: Knowledge distillation (KD) is a widely used approach for compressing large neural networks into compact student models by combining supervised learning with teacher-guided alignment. While recent studies have attempted to improve KD through adaptive weighting between the supervised and distillation objectives, most existing methods determine weights solely from gradients computed on a single mini-batch. This batch-local perspective neglects the crucial requirement that student updates should generalize across unseen data, often resulting in gradient conflicts, unstable training dynamics, and suboptimal performance. In this work, we introduce a cross-batch dynamic weighting framework for KD that explicitly incorporates generalization signals beyond the current batch. At each iteration, we leverage an auxiliary batch as a proxy for unseen data, compute its supervised gradient as a reference, and solve a lightweight quadratic program to adaptively select weights that align the combined update direction with this reference. To further stabilize optimization, we normalize task gradients and introduce a scaling mechanism that balances their magnitudes while maintaining computational efficiency. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms fixed-weight and batch-local adaptive baselines, leading to more stable optimization and superior student performance. These results highlight the importance of cross-batch consistency in KD and establish our method as a principled and effective strategy for dynamic loss balancing.
Primary Area: optimization
Submission Number: 20287
Loading