Keywords: visual chain of thought, interleaved text and image generation, multimodal reasoning
Abstract: Humans often rely on visual aids, such as diagrams or sketches, when tackling complex problems. Teaching multimodal models to adopt similar strategies, a process known as Visual Chain of Thought (visual CoT), is much more difficult. The main challenges are: (1) weak performance of off-the-shelf visual CoT, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce **Zebra-CoT** a diverse large-scale interleaved text-image reasoning dataset with 182,384 reasoning traces across 18 domains with over 50 distinct tasks.
This dataset is specifically designed to train models to natively perform visual CoT.
We emphasize four categories of tasks where sketching or visual reasoning is especially natural, spanning (a) *scientific questions* such as geometry, physics, and algorithms;
(b) *2D visual reasoning tasks* like visual search and jigsaw puzzles;
(c) *3D reasoning tasks* including 3D multi-hop inference, embodied and robot planning;
and (d) *visual logic problems and strategic games* like chess.
Fine-tuning Anole‑7B model on Zebra-CoT yields a +12\% improvement in our test‑set accuracy and up to +13\% performance gains on standard VLM benchmarks.
Similarly, fine-tuning Bagel‑7B produces models capable of generating high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness in advancing multimodal reasoning.
Primary Area: datasets and benchmarks
Submission Number: 21509
Loading