Keywords: spurious correlation, vision-language models, density
Abstract: Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain highly sensitive to spurious correlations, where common background or contextual cues dominate predictions over semantic content. Earlier solutions typically rely on fine-tuning, but this undermines the advantages of pre-trained models. Others depend on prompt engineering, which is prone to hallucination issues. In addition, most approaches are limited to a single modality, increasing the risk of misalignment between text and images. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure that rescales similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17746
Loading