Cross-Batch Gradient Consistency for Adaptive Loss Balancing in Knowledge Distillation

Cross-Batch Gradient Consistency for Adaptive Loss Balancing in Knowledge Distillation

ICLR 2026 Conference Submission20287 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Dynamic Weighting

Abstract: Knowledge distillation (KD) is a widely used approach for compressing large neural networks into compact student models by combining supervised learning with teacher-guided alignment. While recent studies have attempted to improve KD through adaptive weighting between the supervised and distillation objectives, most existing methods determine weights solely from gradients computed on a single mini-batch. This batch-local perspective neglects the crucial requirement that student updates should generalize across unseen data, often resulting in gradient conflicts, unstable training dynamics, and suboptimal performance. In this work, we introduce a cross-batch dynamic weighting framework for KD that explicitly incorporates generalization signals beyond the current batch. At each iteration, we leverage an auxiliary batch as a proxy for unseen data, compute its supervised gradient as a reference, and solve a lightweight quadratic program to adaptively select weights that align the combined update direction with this reference. To further stabilize optimization, we normalize task gradients and introduce a scaling mechanism that balances their magnitudes while maintaining computational efficiency. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms fixed-weight and batch-local adaptive baselines, leading to more stable optimization and superior student performance. These results highlight the importance of cross-batch consistency in KD and establish our method as a principled and effective strategy for dynamic loss balancing.

Primary Area: optimization

Submission Number: 20287

Loading