A closer look at the explainability of Contrastive language-image pre-training

Published: 01 Jan 2025, Last Modified: 21 Jul 2025Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•We observe that CLIP exhibits opposite visualization and noisy activations.•We find that inconsistent self-attention and redundant features cause these issues.•The CLIP Surgery is proposed for reliable CAM, with architecture and feature surgery.•Our method greatly improves the explainability of CLIP with wide applicability.
Loading