Sparse Annotation, Dense Supervision: Unleashing Self-Training Power for Occupancy Prediction With 2D Labels
Abstract: Serving as a fundamental task in robotic navigation and autonomous driving, occupancy prediction is gaining increasing attention for its fine-grained perception of the 3D environment. Most existing methods rely on dense 3D annotations, which are expensive, labor-intensive, and difficult to scale in real-world applications. Recent studies explore the use of cheaper and easy-to-obtain sparse 2D labels as a more scalable alternative. Though achieving some progress, these methods often underperform compared to fully supervised counterparts due to the lack of supervision in unlabeled regions. To bridge the gap, we propose a self-training framework that generates supervision in unannotated areas. A key component of self-training is the use of a teacher-student framework, where the teacher generates pseudo labels to guide student learning. However, a naive teacher tends to produce predictions that are sparse, noisy, and closely resemble the student's output, making it ineffective for guiding student learning. To ensure effective knowledge transfer, we propose three key strategies: (1) strengthening the supervision signal by integrating prior knowledge to guide the teacher network, (2) improving pseudo label quality by filtering out uncertain predictions, and (3) densifying supervision by aggregating predictions across frames. Experiments on Occ3D-nuScenes and SemanticKITTI demonstrate that our method achieves state-of-the-art performance under the weakly supervised setting. Particularly, it achieves 29.05 mIoU and 33.6 RayIoU on Occ3D-nuScenes, which is comparable to some fully supervised ones.
External IDs:doi:10.1109/lra.2025.3632717
Loading