Abstract: Highlights•We observe that CLIP exhibits opposite visualization and noisy activations.•We find that inconsistent self-attention and redundant features cause these issues.•The CLIP Surgery is proposed for reliable CAM, with architecture and feature surgery.•Our method greatly improves the explainability of CLIP with wide applicability.
Loading