Abstract: Vision-language models (VLMs) can effectively act as visual assistants, interpreting questions about images and producing human-like responses. This work explores their abilities to demonstrate human-like reasoning. To address concerns about the consistency of VLMs' reasoning, we introduce a chain-of-thought (CoT) consistency measure. We tackle the challenge of extensive human annotations by proposing an LLM-Human-in-the-Loop pipeline. Based on this pipeline, we build the \textbf{\benchmark} benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The framework consists of two primary stages: supervised fine-tuning and learning from feedback, to guide VLMs in generating reasoning chains that exhibit both consistency and groundedness. Our framework exhibits a 4\% relative improvement in reasoning performance and consistency.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading