Dissecting Zero-Shot Visual Reasoning Capabilities in Vision and Language Models

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Visual Reasoning, Multimodal Models, Large Language Models, Chain-of-Thought Reasoning, LLM Grounding, Sensory Grounding, Synthetic Datasets, Benchmarking, Automated Prompt Generation
TL;DR: This study benchmarks Chain-of-Thought Prompting techniques of multimodal models in zero-shot settings using synthetic datasets and compares the performance of LLMs to their multimodal versions to study the impact of visual grounding mechanisms.
Abstract: Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks such as VQAv2 and OK-VQA, alluding to their capabilities as visual reasoning engines. However, the benchmarks being used have questions that involve a limited number of reasoning steps, and also conflate visual reasoning with world knowledge. Consequently, it remains unclear whether a VLM’s apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. Hence, we systematically examine and benchmark the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We specifically focus on zero-shot reasoning rather than few-shot or fine-tuned approaches to gain clearer insights into the inherent capabilities of the models, without additional influence of task-specific training data. We design novel scene-informed prompting techniques and focus on two novel aspects of zero-shot visual reasoning: i) evaluating the impact of conveying visual scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM, and ii) comparing the effectiveness of chain-of-thought prompting to standard prompting for zero-shot visual reasoning. Notably, we find that: i) the underlying LLMs of VLMs when provided only ground-truth textual scene descriptions, consistently perform better in comparison to being provided visual embeddings, particularly achieving ∼18% higher accuracy on the PTR dataset, and ii) CoT prompting performs marginally better than standard prompting only for the comparatively large GPT-3.5-Turbo (175B) model, and does worse for smaller-scale models. This suggests the emergence of CoT abilities for visual reasoning in LLMs at larger scales even when world knowledge is limited. Overall, we find limitations in the reasoning capabilities of VLMs and LLMs for more complex visual reasoning, and highlight the important role that LLMs can play in visual reasoning.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1914
Loading