Uncover the balanced geometry in long-tailed contrastive language-image pretraining

Published: 01 Jan 2025, Last Modified: 03 Sept 2025Mach. Learn. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: While Contrastive Language-Image Pretraining (CLIP) has become the de facto standard for vision-language pretraining tasks, the exploration on the inherent long-tailed pretraining data distribution remains limited. From a neural collapse perspective, we show in principle that the vanilla CLIP training can be vulnerable to the long-tailed distributions, which might distort the representations with reduced inter-class separation and poor discriminative ability. To combat this issue, we propose an improved method, termed as Geometry-Balanced CLIP (GeoCLIP), which automatically constructs pseudo clusters and aligns them with a predefined equiangular geometric structure, thereby enjoying the theoretical merits of better maintaining the uniformity at the semantic level. Furthermore, we enhance GeoCLIP’s generality for real-world complex distributions by incorporating harmonized clusters that integrate both empirically observed data structures and theoretically optimal geometry. Extensive experiments across various benchmarks demonstrate the consistent superiority of GeoCLIP in achieving robust and transferable representation under long-tailed distributions. The source code will be publicly available.
Loading