PAC-Bayesian Analysis of the Surrogate Relation between Joint Embedding and Supervised Downstream Losses

Theresa Wasserer; Maximilian Fleissner; Debarghya Ghoshdastidar

PAC-Bayesian Analysis of the Surrogate Relation between Joint Embedding and Supervised Downstream Losses

Theresa Wasserer, Maximilian Fleissner, Debarghya Ghoshdastidar

Published: 18 Dec 2025, Last Modified: 21 Feb 2026ALT 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-supervised learning, representation learning, generalization, PAC-Bayes bounds

TL;DR: A generalization analysis of performance transfer in self-supervised learning pipelines, specifically considering state-of-the-art joint embedding loss functions.

Abstract: In recent years, self-supervised representation learning (SSL) has become an important learning paradigm and a crucial component of foundation models. SSL-based training pipelines are typically formalized as a sequence of two tasks—a pretext task that learns representations from large amounts of augmented unlabeled data, and a downstream task, where a simple model is fit on the learned representations with the help of little labeled data. The strong empirical performance of SSL-based pipelines for prominent joint embedding loss functions is not yet well explained in theory due to two main reasons: a lack of non-vacuous generalization bounds for the models learned in the pretext task, and a lack of practically computable transfer bounds that describe how generalization bounds derived for the pretext task transfer to the downstream task. In this work, we first derive non-vacuous PAC Bayesian generalization bounds for models optimized in the pretext task with prominent joint embedding SSL loss functions (VICReg, Barlow Twins, and Spectral Contrastive loss), accounting for their non-i.i.d. nature. Next, we provide the first practically computable transfer bounds for our considered loss functions by formally proving a surrogate relation that upper bounds the downstream squared L2 loss by the SSL pretext loss and a more accurate measure for the influence of the chosen augmentations than in previous work. In addition, our theoretical analysis identifies effective hyperparameter choices, thereby reducing the need for extensive hyperparameter tuning and offering principled guidance for model selection. We empirically validate our theoretical findings on CIFAR-10 and MNIST datasets.

Submission Number: 50

Loading