4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0
Keywords: 4D understanding, MLLM, VLM, Region-level understanding
TL;DR: We propose: 1) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception, and 2) R4D-Bench, a new benchmark for depth-aware dynamic scenes with region-level prompting.
Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) \ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark. Project page: https://www.ca-joe-yang.com/resource/projects/4D_RGPT/.
Supplementary Material: pdf
Previously Accepted: Yes
Previous Venue: CVPR 2026
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 12
Loading