CSV-Occ: Fusing Multi-frame Alignment for Occupancy Prediction with Temporal Cross State Space Model and Central Voting Mechanism

Ziming Zhu; Yu Zhu; Jiahao Chen; LingXiaofeng; Huanlei Chen; Lihua Sun

CSV-Occ: Fusing Multi-frame Alignment for Occupancy Prediction with Temporal Cross State Space Model and Central Voting Mechanism

Ziming Zhu, Yu Zhu, Jiahao Chen, LingXiaofeng, Huanlei Chen, Lihua Sun

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, image-based 3D semantic occupancy prediction has become a hot topic in 3D scene understanding for autonomous driving. Compared with the bounding box form of 3D object detection, the ability to describe the fine-grained contours of any obstacles in the scene is the key insight of voxel occupancy representation, which facilitates subsequent tasks of autonomous driving. In this work, we propose CSV-Occ to address the following two challenges: (1) Existing methods fuse temporal information based on the attention mechanism, but are limited by high complexity. We extend the state space model to support multi-input sequence interaction and conduct temporal modeling in a cascaded architecture, thereby reducing the computational complexity from quadratic to linear. (2) Existing methods are limited by semantic ambiguity, resulting in the centers of foreground objects often being predicted as empty voxels. We enable the model to explicitly vote for the instance center to which the voxels belong and spontaneously learn to utilize the other voxel features of the same instance to update the semantics of the internal vacancies of the objects from coarse to fine. Experiments on the Occ3D-nuScenes dataset show that our method achieves state-of-the-art in camera-based 3D semantic occupancy prediction and also performs well on lidar point cloud semantic segmentation on the nuScenes dataset. Therefore, we believe that CSV-Occ is beneficial to the community and industry of autonomous vehicles.

Lay Summary: Recently, predicting 3D semantic occupancy from images has become popular in self-driving 3D scene understanding. Voxel occupancy can describe fine-grained obstacle contours better than 3D object detection's bounding boxes, helping self-driving tasks. Our CSV-Occ method meets two challenges. First, we simplify temporal information fusion by extending the state space model, cutting computational complexity. Second, we help the model accurately vote for voxel-belonging instance centers to fix semantic ambiguity. Tests on Occ3D-nuScenes and nuScenes lidar data show our method excels in camera-based 3D occupancy prediction and lidar semantic segmentation. We think CSV-Occ benefits the self-driving community and industry.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Computer Vision

Keywords: 3D Semantic Occupancy Prediction, Semantic Segmentation, Autonomous Driving, Temporal Information, State Space Model, Voting Mechanism

Submission Number: 1568

Loading