CAFÉ: Coverage-Aware Self-Distillation to Mitigate Forgetting in Deep Networks

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Overfitting, Double Descent, Knowledge Distillation, Self Distillation, Label Noise, Checkpoint Ensembles
TL;DR: CAFÉ is a coverage-aware self-distillation method that tracks validation and dynamically reuses past checkpoints to prevent local overfitting, yielding consistently higher accuracy and robustness under clean and noisy labels
Abstract: Deep neural networks rarely exhibit global overfitting in the classical sense, yet they often suffer from a less visible problem - forgetting of previously learned patterns. This phenomenon, which was termed local overfitting, degrades performance in specific regions of the input space even as overall accuracy improves. To address this problem, we propose CAFÉ (Coverage-Aware Forgetting Elimination) - an online, validation-aware, single model method, which mitigates forgetting during training while exploiting self-distillation. CAFÉ identifies and prioritizes checkpoints that uniquely recover forgotten validation samples, dynamically weighting their contributions to form evolving soft labels for each epoch of training. Our experiments show that CAFÉ consistently outperforms both standard training and recent self-distillation SOTA methods under clean and noisy labels, across CIFAR-100 and TinyImageNet, with and without data augmentation. Beyond raw accuracy gains, our results provide quantitative evidence of the substantial impact of forgetting on deep learning performance, and demonstrate that targeted mitigation yields measurable robustness.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 7494
Loading