Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language model, spatial reasoning, in-context learning, curriculum learning
TL;DR: We propose a synthetic dataset named SVAT that widely challenges state-of-the-art VLMs on doing ambiguous spatial reasoning tasks with vision demonstrations.
Abstract: Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts from visual demonstrations with ambiguous text queries, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance. We release our benchmark generation, training, and evaluation code to facilitate future research.
Submission Number: 30
Loading