Bayesian Self-Distillation for Image Classification

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-distillation, Knowledge Distillation, Bayesian Methods, Calibration, Robustness, Image Classification
Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and often limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model’s own predictions, but often remain dependent on hard targets, limiting their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model’s own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4\% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40\% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods. Code will be made available upon acceptance.
Supplementary Material: zip
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 14009
Loading