STM4D: 4D Occupancy Forecasting with 2D and 3D Spatio-Temporal Modeling

STM4D: 4D Occupancy Forecasting with 2D and 3D Spatio-Temporal Modeling

ICLR 2026 Conference Submission12972 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 4D Occupancy, Spatio-temporal Modeling, Semi-Supervised

Abstract: Vision-based 4D occupancy forecasting enables autonomous vehicles to predict future 3D semantic scenes from historical multi-view images, which is critical for driving safety. While current methods show promising results, the potential of simultaneous 2D and 3D spatio-temporal modeling and leveraging temporal cues from 2D multi-view image sequences to improve 4D occupancy prediction remains unexplored, presenting a critical bottleneck for advancing performance. To address this gap, we introduce STM4D, a novel framework for 4D occupancy prediction that jointly models temporal dynamics in both voxel-based representations and multi-view image sequences, while explicitly incorporating feature interaction between the two complementary branches. Our framework incorporates three core components: 1) A 3D Spatio-Temporal (3DST) module that learns volumetric dynamics from historical voxel states to predict future voxel states; 2) A 2D Spatio-Temporal (2DST) module employing an auxiliary segmentation forecasting task to enhance temporal semantic consistency; 3) A Spatio-Temporal Interaction Modeling (STIM) module that enables camera-agnostic feature interaction between 2D and 3D representations. The unified architecture is trained end-to-end and establishes new state-of-the-art performance on both Occ3D-nuScenes and Cam4DOcc benchmarks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12972

Loading