TL;DR: we define a framework for measuring simple-to-hard generalization on text/visual inputs for VLMs and propose strategies to mitigate modality imbalance
Abstract: Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning---even compared to LLMs on the same tasks presented in text form---giving rise to perceptions of *modality imbalance* or *brittleness*. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.
Lay Summary: A family of models called Vision Language Models (VLMs) have been developed towards solving tasks on both text and image inputs. However, recent studies show signs of *modality imbalance*, where the trained models reason better if the same task is presented as text instead of image.
We design a framework to quantify and mitigate the *modality imbalance*. We propose three algorithmic reasoning tasks, where each task 1) has two levels of diffuclty (SIMPLE and HARD); and 2) has an equivalent pair of text-only and image-only versions. We propose different strategies for training on the SIMPLE version of task and evaluate them on the corresponding HARD version. By comparing the simple-to-hard (S2H) generalization---i.e., the accuracy on the HARD version---we can quantify modality imbalance and assess how it is impacted by the training strategy.
We first show that training on image-only version of the task shows poorer reasoning capability than training on text-only version, showcasing a concrete sign of *modality imbalance*. Next, we show that we can promote the model's reasoning capability on image by training the model to explicitly convert a provided image to its equivalent text representation before solving the task---here, the model is transferring its text reasoning towards image input. Furthermore, this conversion step can later be internalized, meaning once the training is complete, the model can skip the conversion and directly solve the task.
We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Our ablation studies highlight the importance of chain-of-thought in the process of transfer of reasoning capabilities across modalities.
Link To Code: https://github.com/princeton-pli/VLM_S2H
Primary Area: Deep Learning->Large Language Models
Keywords: vision language models, modality imbalance, SIMPLE-to-HARD generalization, gradient alignment
Submission Number: 2277
Loading