Abstract: Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Lay Summary: AI models like ChatGPT are good at solving problems by thinking step-by-step in words. But they often struggle with tasks that involve understanding space, like imagining how objects move or fit together. Humans, on the other hand, use both language and mental pictures to reason through such challenges.
We created a new method called Multimodal Visualization-of-Thought (MVoT) that allows AI to “draw out” its thoughts as it reasons. Just like a person might sketch a diagram to solve a puzzle, our method lets the model generate helpful images to support its thinking. We also trained it to make these images clearer and more accurate.
This helps AI handle more complex problems, especially those that require visual thinking. MVoT brings us closer to building smarter, more human-like AI that can think in both words and pictures.
Primary Area: Deep Learning->Large Language Models
Keywords: Multimodal Large Language Models, Spatial Reasoning
Submission Number: 10052
Loading