Fallback-Enabled Closed-Set Classification: Cross-Modal Consistency in Vision-Language Models

12 Feb 2026 (modified: 11 May 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-Language Models (VLMs) can describe and label images; however, this does not imply that they truly process what they are perceiving. Recent studies show that, despite their breadth of training, VLMs are surprisingly unreliable as classifiers, for either closed-world or open-world settings. In this work, we explore a deeper question: can a VLM recognize when an image falls outside the set of categories it is asked to choose from? Our results reveal a surprising failure mode: even when the notion of in-set versus out-of-set is explicitly defined, VLM models often assign plausible in-set labels to out-of-set images, violating the task’s explicit constraint. Motivated by this, we propose a cross-modal consistency framework that reasons over both the visual and textual arms of the model and accepts an answer only when they agree. Experiments on three well-known datasets (DomainNet, VisDA and INaturalist-2021) demonstrate that this approach consistently improves balanced known vs. unknown detection over Source-Free Universal Domain Adaptation (SF-UniDA) baselines, showing that cross-modal consistency improves a VLM’s ability to follow the task logic and distinguish when an image falls outside the intended label space. Our results suggest that, with strong VLMs, fallback behavior need not rely exclusively on specialized SF-UniDA adaptation pipelines: a lightweight cross-modal consistency decision rule can be competitive with representative SF-UniDA baselines on standard benchmarks.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Stephen_Lin1
Submission Number: 7476
Loading