Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Wei Jie Yeo; Rui Mao; Moloud Abdar; Ranjan Satapathy; Erik Cambria

Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Wei Jie Yeo, Rui Mao, Moloud Abdar, Ranjan Satapathy, Erik Cambria

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: debiasing, clip, interpretability

TL;DR: This work propose an interpretability-inspired technique that improves CLIP models on image classification task with spurious correlations.

Abstract: Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50\\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 5049

Loading