More Context, Less Distraction: Improving Zero-Shot Inference of CLIP by Inferring and Describing Spurious Features
Keywords: CLIP, zero-shot classification, spurious feature, generative factor
Abstract: CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully utilize CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process, where a modern view is that when classifying an image of an object, humans will first infer its class-independent attributes such as background, orientation, and illumination, and then classify based on them. Similarly, we observe that providing CLIP with the object attributes improves classification, and that CLIP itself can reasonably infer the attributes from an image. Based on these, we propose PerceptionCLIP, a training-free zero-shot inference method. Given an image, it first infers the object attributes, and then does classification conditioning on them. Experiments show that PerceptionCLIP achieves better generalization, less dependence on spurious features, and better interpretability. For example, PerceptionCLIP improves average accuracy by 3.3\% and worst-group accuracy by 24.8\% on the Waterbirds dataset.
Submission Number: 59