Seeing Once is Enough? Online Geometry-Aware Token Pruning for 3D Question Answering

Published: 03 Mar 2026, Last Modified: 05 Mar 2026ES-Reasoning @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial Understanding, 3D Question Answering
Abstract: Recent Multi-modal Large Language Models (MLLMs) have demonstrated remarkable performance on 2D question answering tasks. However, extending these models to the 3D question answering remains challenging, as they typically require multiple views of the scene, which incurs substantial computational cost at inference. To mitigate this issue, existing solutions rely on strategic frame selection or token-merging algorithms that require preprocessing in advance all frames of the scene, i.e., an offline fashion. In contrast, we propose the first online tokenpruning method that can be integrated seamlessly with current MLLM models for 3D question answering tasks, without additional training and with lower memory usage. Our key insight is to project each input frame into a shared voxel space using depth information and camera pose, identifying spatially-overlapped regions across frames and selectively pruning redundant image tokens before they enter the language model. Our method enables efficient online processing while reducing up to 50% of token usage. We apply this approach to Qwen2.5-VL-7B and Qwen3-VL-8B, demonstrating improved performance on the ScanQA, SQA3D, and OpenEQA-HM3D benchmarks.
Submission Number: 41
Loading