Keywords: Occupancy prediction, scene understanding, embodied, geometry model
TL;DR: We present VGMOcc, a framework that leverages visual geometry models for indoor occupancy prediction, achieving significant performance gains on Occ-ScanNet and EmbodiedOcc-ScanNet.
Abstract: Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, Visual Geometry Models (VGMs) such as VGGT have shown strong capability in providing rich 3D priors, yet their outputs are constrained to visible surfaces and fail to capture volumetric interiors.
We present VGMOcc, a framework that adapts VGM priors for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation.
Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: VGMOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that VGMOcc effectively leverages VGMs for occupancy prediction and generalizes seamlessly to alternative 3D priors.
Code will be released.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 2308
Loading