ST$^2$360D: Spatial-to-Temporal Consistency for Training-free 360 Monocular Depth Estimation

Zidong Cao; Jinjing Zhu; Hao Ai; Lutao Jiang; Yuanhuiyi Lyu; Hui Xiong

ST$^2$360D: Spatial-to-Temporal Consistency for Training-free 360 Monocular Depth Estimation

Zidong Cao, Jinjing Zhu, Hao Ai, Lutao Jiang, Yuanhuiyi Lyu, Hui Xiong

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 360 camera, monocular depth estimation

Abstract: 360-degree monocular depth estimation plays a crucial role in scene understanding owing to its 180-degree by 360-degree field-of-view (FoV). To mitigate the distortions brought by equirectangular projection, existing methods typically divide 360-degree images into distortion-less perspective patches. However, since these patches are processed independently, depth inconsistencies are often introduced due to scale drift among patches. Recently, video depth estimation (VDE) models have leveraged temporal consistency for stable depth predictions across frames. Inspired by this, we propose to represent a 360-degree image as a sequence of perspective frames, mimicking the viewpoint adjustments users make when exploring a 360-degree scenario in virtual reality. Thus, the spatial consistency among perspective depth patches can be enhanced by exploiting the temporal consistency inherent in VDE models. To this end, we introduce a training-free pipeline for 360-degree monocular depth estimation, called ST²360D. Specifically, ST²360D transforms a 360-degree image into perspective video frames, predicts video depth maps using VDE models, and seamlessly merges these predictions into a complete 360-degree depth map. To generate sequenced perspective frames that align with VDE models, we propose two tailored strategies. First, a spherical-uniform sampling (SUS) strategy is proposed to facilitate uniform sampling of perspective views across the sphere, avoiding oversampling in polar regions typically with limited structural details. Second, a latitude-guided scanning (LGS) strategy is introduced to organize the frames into a coherent sequence, starting from the equator, prioritizing low-latitude slices, and progressively moving toward higher latitudes. Extensive experiments demonstrate that ST²360D achieves strong zero-shot capability on several datasets, supporting resolutions up to 4K.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 6010

Loading