In- and Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks
Keywords: reasoning LLMs, out-of-distribution generalization
Abstract: Integrating reasoning in (multimodal) large language models has recently led to significant improvement of their capabilities.
However, generalization in reasoning models is still vaguely defined and poorly understood.
In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize in simple visual planning, specifically on a grid-based navigation task.
The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions.
Our experiments show that the out-of-distribution generalization (e.g., to larger maps) is largely impacted by the format used for input maps and CoT chains.
Surprisingly, we find that reasoning traces which combine multiple text formats yield the best OOD generalization.
Moreover, CoT reproducing the steps of the A* algorithm yields the state-of-the-art ID accuracy, and simple augmentation of the map solutions seen during training greatly boosts OOD results.
Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.
Submission Number: 55
Loading