Logic Channel Validation and Enhancement of Zero-Shot Vision-Language Comprehension on Vision Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Model, Visual-Language Comprehension, logic reasoning, zero-shot
Abstract: Frontier Large Vision-Language Models (LVLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks, enabled by pretraining on vast visual-textual corpus. However, they are often deployed as zero-shot solution in a black-box manner, as retraining challenges remain due to data privacy or model inaccessibility. Validating and understanding the behavior of the models become important for generalization to new task. We propose a Logic Channel, in parallel with the black-box model channel, to perform explicit logic reasoning for validation and enhancement. The frontier LVLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logic reasoning, incorporates a Large Language Model (LLM), a Visual Foundation Model (VFM), and a logical reasoning module involving novel probabilistic inference for factual, counterfactual, relational, and causal condition reasoning over the extracted and grounded visual-textual facts. Cross-channel logic consistency analysis enables model validation and selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over SOTA models. Our experiments on three recent challenging VLC benchmarks, Neg- Bench, HC-RefCOCOg, and HC-RefLoCo, demonstrate the effectiveness of the proposed Logic Channel for logic-based model validation, selection and improvement on LVLM with enhanced explainability and trustworthiness.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Submission Number: 11168
Loading