Text-Image Dual Consistency-Guided OOD Detection with Pretrained Vision-Language Models

20 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
Abstract: The advent of vision-language models (VLMs) such as CLIP has significantly advanced the development of zero-shot out-of-distribution (OOD) detection. Recent research has largely focused on enhancing the textual label space to improve OOD detection performance. However, these efforts often neglect the valuable information inherent in the image domain. As a result, visual feature similarities within in-distribution (ID) data remain underutilized, limiting the OOD detection capabilities of VLMs. To address this limitation, we propose a novel approach, DualCnst, based on text-image dual consistency. Our method evaluates test samples by jointly considering their semantic similarity to textual labels and their visual similarity to synthesized images generated from the textual label set using a text-to-image generative model. By integrating textual and visual information, this approach establishes a unified OOD scoring framework. Furthermore, this framework is fully compatible with existing methods, such as NegLabel, which focus on enriching the textual label space. Extensive experiments demonstrate that DualCnst achieves state-of-the-art performance across a range of OOD detection benchmarks while exhibiting robust generalization across diverse VLM architectures.
Primary Area: Social Aspects->Robustness
Keywords: Out-of-distribution Detection
Submission Number: 3561
Loading