Foreground or Background? Visual Interpretability and Robustness Analysis of CLIP

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, interpretability, spurious correlations
TL;DR: We introduce a training-free interpretability method (CCI) and a benchmark dataset to quantify whether vision–language models rely on object foreground or spurious background cues
Abstract: Contrastive vision–language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that attributes image–text similarity by grouping patches into coherent clusters, masking them, and evaluating relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI when combined with GroundedSAM, automatically categorizes predictions as foreground or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we conduct a comprehensive evaluation of eighteen CLIP variants, providing both methodological advances and empirical evidence that chart a path toward more robust vision–language models.
Primary Area: interpretability and explainable AI
Submission Number: 9598
Loading