Abstract: The rapid advancement in self-supervised representation learning has highlighted its potential to leverage unlabeled data
for learning rich visual representations. However, the existing
techniques, particularly those employing different augmentations of the same image, often rely on a limited set of simple
transformations that cannot fully capture variations in the real
world. This constrains the diversity and quality of samples,
which leads to sub-optimal representations. In this paper, we
introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce
semantically consistent image augmentations. By directly conditioning generative models on a source image, our method
enables the generation of diverse augmentations while main-
taining the semantics of the source image, thus offering a
richer set of data for SSL. Our extensive experimental results
on various joint-embedding SSL techniques demonstrate that
our framework significantly enhances the quality of learned
visual representations by up to 10% Top-1 accuracy in down-
stream tasks. This research demonstrates that incorporating
generative models into the joint-embedding SSL workflow
opens new avenues for exploring the potential of synthetic
data. This development paves the way for more robust and
versatile representation learning techniques.
Loading