ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.
Lay Summary: Reasoning language models often solve hard problems by generating long step-by-step explanations. These “chains of thought” improve accuracy, but they also make models slow and expensive because every reasoning step must be stored and processed. Recent work tries to compress these reasoning traces into compact internal representations, but most methods still force models to reconstruct the original text. This means the compressed representation spends too much effort remembering writing style and wording instead of the actual reasoning process. We introduce ImgCoT, a new way to compress reasoning by turning chains of thought into images. Instead of reconstructing text, our model reconstructs visual layouts of the reasoning steps. This changes the model’s focus from language details to the overall structure of reasoning, helping it capture logical patterns more effectively. We also propose loose ImgCoT, which combines the compact visual representation with a few important text steps that the model is least certain about. This preserves critical details while still using far fewer tokens than the full explanation. Experiments across multiple datasets and language models show that our approach maintains strong reasoning performance with much shorter reasoning traces. Our work suggests that visual representations could become a powerful new tool for building faster and more efficient reasoning systems.
Primary Area: Deep Learning->Large Language Models
Keywords: Chain-of-Thought, Latent reasoning, Efficient Reasoning
Originally Submitted PDF: pdf
Submission Number: 2886
Loading