Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing

Yixian Shen; Qi Bi; Zihan Wang; Zhiheng Yang; Changshuo Wang; Zhi Zhang; Prayag Tiwari; Andy D. Pimentel; Anuj Pathania

Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing

Yixian Shen, Qi Bi, Zihan Wang, Zhiheng Yang, Changshuo Wang, Zhi Zhang, Prayag Tiwari, Andy D. Pimentel, Anuj Pathania

Published: 26 Jan 2026, Last Modified: 21 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Autoregressive multimodal reasoning, Layer- and hop-aware retention, Green AI

TL;DR: We introduce DARE, a novel framework for efficient multimodal reasoning that leverages visual-to-text information flow and asymmetric routing, cutting FLOPs and KV-cache by over 40% without loss of accuracy.

Abstract: Recently, visualization-of-thought (VoT) has unlocked new opportunities for complex spatial reasoning in multimodal large language models (MLLMs) by complementing verbal reasoning with visual thinking. However, the autoregressive accumulation of lengthy and redundant tokens substantially increases computation and memory costs. In this paper, we present a new efficient framework for multimodal spatial reasoning, named *DARE*, designed to adaptively prune multimodal tokens across different network depths, reasoning hops, and modalities. First, *DARE* devises an intra- and inter-hop-aware differentiable retention mechanism to dynamically estimate token importance both within each reasoning step and across successive hops. Recognizing that deeper network layers encode visual cues into verbal streams, *DARE* introduces an asymmetric compression strategy that prunes tokens according to modality-specific redundancy and semantic importance. Furthermore, *DARE* incorporates a progressive KV-cache retention policy aligned with cross-modal fusion dynamics, further reducing memory overhead during autoregressive reasoning. Our method delivers substantial reductions in computation and memory footprint, averaging a 40.37\% reduction in FLOPs and 46.07\% reduction in KV caches usage, while consistently preserving or even improving reasoning performance across seven multimodal spatial reasoning benchmarks, and further generalizing to broader multimodal reasoning tasks, establishing a scalable and robust recipe for efficient multimodal reasoning.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14462

Loading