Dissecting Zero-Shot Visual Reasoning Capabilities in Vision and Language Models

Published: 19 Mar 2024, Last Modified: 08 May 2024Tiny Papers @ ICLR 2024 NotableEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Reasoning, Multimodal Models, Large Language Models, Chain-of-Thought Reasoning, LLM Grounding, Sensory Grounding, Synthetic Datasets, Benchmarking, Automated Prompt Generation
TL;DR: This study systematically evaluates zero-shot visual reasoning in VLMs and LLMs, revealing better performance with textual scene descriptions than with visual embeddings, highlighting limitations and improvements in visual reasoning capabilities.
Abstract: Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, existing works (typically) use benchmarks that conflate “pure” visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM’s apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We specifically focus on evaluating the impact of conveying scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM. We notably find that the underlying LLMs, when provided textual scene descriptions, consistently perform significantly better compared to being provided visual embeddings. Our work comprehensively identifies limitations of VLMs for compositional visual reasoning, and highlights the important role that LLMs can play in scene understanding and visual reasoning.
Supplementary Material: zip
Submission Number: 84
Loading