Abstract: Vision-language models (VLMs) can effectively act as visual assistants, interpreting questions about images and producing human-like
responses. This work explores their abilities to
demonstrate human-like reasoning. To address
concerns about the consistency of VLMs’ reasoning, we introduce a chain-of-thought (CoT)
consistency measure. We tackle the challenge
of extensive human annotations by proposing
an LLM-Human-in-the-Loop pipeline. Based
on this pipeline, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs.
We evaluate state-of-the-art VLMs and find
that even the best-performing model is unable
to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs
to perform visual reasoning as systematically
and consistently as humans. As an early step,
we propose a two-stage training framework
aimed at improving both the reasoning performance and consistency of VLMs without
human annotations. The framework consists
of two primary stages: supervised fine-tuning
and learning from feedback, to guide VLMs in
generating reasoning chains that exhibit both
consistency and groundedness. Our framework exhibits a 4% relative improvement in
reasoning performance and consistency. We
release the dataset at https://github.com/
Yangyi-Chen/CoTConsistency.
Loading