Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

Peter Ochieng

Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

Peter Ochieng

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diversity, contrastive learning, batch selection

Abstract: Contrastive learning thrives—or fails—based on how we construct \emph{positive} and \emph{negative} pairs. In the absence of explicit labels, models must infer semantic structure from these proxy signals. Early work on Siamese networks \citep{chopra2005learning,hadsell2006dimensionality} already showed that pair construction directly shapes learned representations. In modern contrastive frameworks, poor pair selection remains a primary failure mode: it either causes collapse, where all embeddings converge to a point, or wastes the representational capacity of the space \citep{chen2020simple,tian2020makes,khosla2020supervised}. Contemporary methods typically generate positives via semantic-preserving augmentations (crop, jitter, view transform), while negatives are drawn from other elements in the mini-batch under the assumption that different images are semantically dissimilar. But this assumption breaks down in fine-grained, low-diversity, or high-resolution settings \citep{kalantidis2020hard,robinson2020contrastive,chuang2020debiased}, motivating techniques such as hard-negative mining and debiased losses \citep{bachman2019learning,tian2020makes}. \paragraph{Beyond pairs: batch-level diversity.} While most prior work focuses on \emph{which} individual negatives to select, we study the geometry of the entire batch. Our central observation is this: the overall \emph{diversity} of the batch embedding space strongly governs both training dynamics and representational quality. If diversity is too low, the model sees nearly identical negatives and gradients vanish—leading to collapse. If diversity is too high, negatives become almost orthogonal, but the resulting gradients shrink in magnitude, and learning slows. Optimal training thus occurs within a \emph{moderate diversity window}: high enough to avoid collapse, low enough to preserve update strength.

Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)

Submission Number: 23493

Loading