ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Xinpeng Dong; Min Zhang; Didi Zhu; Ye Jun Jian; Zhang Keli; Aimin Zhou; Fei Wu; Kun Kuang

ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Xinpeng Dong, Min Zhang, Didi Zhu, Ye Jun Jian, Zhang Keli, Aimin Zhou, Fei Wu, Kun Kuang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Pre-trained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.

Lay Summary: Vision language models (VLMs) often build spurious correlations—for example, if a cow is always on grass during pre-training, the model will identify grass as part of the cow. We developed ERICT technology to enable VLMs to automatically "block out noise information" when analyzing images. Just like humans selectively listening in a noisy environment, our method has two key steps: (1) helping VLMs identify the truly important features in the image; (2) dynamically adjusting the focus of VLMs' attention as they process the image to prevent them from being distracted by irrelevant details. This method is like installing a smart filter on VLMs, which takes effect immediately at the final decision stage without additional training. Experiments have shown that ERICT significantly improves robustness of VLMs in many scenarios.

Primary Area: Deep Learning->Robustness

Keywords: Robustness, Spurious correlation, VLM, Zero shot

Submission Number: 1324

Loading