Keywords: Autoregressive multimodal reasoning, Layer- and hop-aware retention, Green AI
TL;DR: We introduce DARE, a novel framework for efficient multimodal reasoning that leverages visual-to-text information flow and asymmetric routing, cutting FLOPs and KV-cache by over 40% without loss of accuracy.
Abstract: Recently, visualization-of-thought (VoT) has unlocked new opportunities for complex spatial reasoning in multimodal large language models (MLLMs) by complementing verbal reasoning with visual thinking.
However, the autoregressive accumulation of lengthy and redundant tokens substantially increases computation and memory costs.
In this paper, we present a new efficient framework for multimodal spatial reasoning, named *DARE*, designed to adaptively prune multimodal tokens across different network depths, reasoning hops, and modalities.
First, *DARE* devises an intra- and inter-hop-aware differentiable retention mechanism to dynamically estimate token importance both within each reasoning step and across successive hops.
Recognizing that deeper network layers encode visual cues into verbal streams, *DARE* introduces an asymmetric compression strategy that prunes tokens according to modality-specific redundancy and semantic importance.
Furthermore, *DARE* incorporates a progressive KV-cache retention policy aligned with cross-modal fusion dynamics, further reducing memory overhead during autoregressive reasoning.
Our method delivers substantial reductions in computation and memory footprint, averaging a 40.37\% reduction in FLOPs and 46.07\% reduction in KV caches usage,
while consistently preserving or even improving reasoning performance across seven multimodal spatial reasoning benchmarks, and further generalizing to broader multimodal reasoning tasks,
establishing a scalable and robust recipe for efficient multimodal reasoning.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 14462
Loading