ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder

Jungho Kim, Changwon Kang, Dongyoung Lee, Sehwan Choi, Jun Won Choi

Published: 01 Jan 2025, Last Modified: 22 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels via a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales using a dual branch structure. This design combines the BEV representation, which offers a large receptive field, with the voxel representation, known for its higher spatial resolution, thereby improving both performance and computational efficiency. The PQD employs two types of prototype-based queries to expedite the Transformer decoding process. Scene-Adaptive Prototypes are generated from the 3D voxel features of the input sample, while Scene-Agnostic Prototypes are updated during training using an Exponential Moving Average of the Scene-Adaptive Prototypes. Using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose Robust Prototype Learning, which introduces noise into the prototype generation process and trains the model to denoise during the training phase. This approach enhances the robustness of ProtoOcc against degraded prototype feature quality. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For the single-frame method, it reaches 39.56% mIoU with 12.83 FPS on an NVIDIA RTX 3090.