Keywords: Spatial Understanding, 3D Question Answering
Abstract: Recent Multi-modal Large Language Models (MLLMs) have demonstrated remarkable performance on 2D question answering tasks. However, extending these
models to the 3D question answering remains challenging, as they typically require multiple views of the scene, which incurs substantial computational cost at
inference. To mitigate this issue, existing solutions rely on strategic frame selection or token-merging algorithms that require preprocessing in advance all frames
of the scene, i.e., an offline fashion. In contrast, we propose the first online tokenpruning method that can be integrated seamlessly with current MLLM models for
3D question answering tasks, without additional training and with lower memory
usage. Our key insight is to project each input frame into a shared voxel space using depth information and camera pose, identifying spatially-overlapped regions
across frames and selectively pruning redundant image tokens before they enter
the language model. Our method enables efficient online processing while reducing up to 50% of token usage. We apply this approach to Qwen2.5-VL-7B and
Qwen3-VL-8B, demonstrating improved performance on the ScanQA, SQA3D,
and OpenEQA-HM3D benchmarks.
Submission Number: 41
Loading