WHY TEACHER–STUDENT SELF-SUPERVISED LEARNING WORKS: A MUTUAL INFORMATION PERSPECTIVE

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning, Non-Contrastive Learning, Mutual Information, Teacher-Student framework
TL;DR: We show that teacher–student SSL implicitly maximizes mutual information, derive convergence results, and propose an MI-based regularizer that boosts BYOL and SimSiam performance on natural and medical imaging benchmarks.
Abstract: We study teacher-student (TS) self-supervised learning methods equipped with a prediction head (e.g., BYOL, SimSiam), which learn meaningful representations without relying on negative samples. Building on the InfoMax perspective that unifies many multi-view Self-Supervised Learning (SSL) families, we show that TS-SSL implicitly maximizes a lower bound on the mutual information $I(Z_\theta; X)$ between the inputs $X$ and the teacher representations $Z_\theta$. Concretely, we prove that, assuming an optimal predictor, BYOL and SimSiam's loss is an approximation $H(Z_\theta \mid Z_\phi, X)$. Building on this results, we prove that, under a mild assumption, verified empirically on six different datasets, the alternating optimization—student prediction (with stop-gradient) followed by teacher updates—implicitly optimizes $\theta$ so that it maximizes $I(Z_\theta; (X, Z_\phi))$ a lower bound on $I(Z_\theta; X)$. Then, we derive increment convergence dynamics of the teacher representation’s entropy and alignment during training. Eventually, motivated by these theoretical insights, we introduce a simple mutual-information–based regularizer on the student latent space that enforces monotonic growth of $I(Z_\theta; X)$ and yields consistent downstream improvements on both natural-image and medical-imaging benchmarks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5849
Loading