PASTEL: Panoramic Alignment for Monocular 4D Reconstruction

Yuankun Yang; Wei Yi; Bo Bai; Wenyang Zhou; Li Zhang

PASTEL: Panoramic Alignment for Monocular 4D Reconstruction

Yuankun Yang, Wei Yi, Bo Bai, Wenyang Zhou, Li Zhang

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: 4D reconstruction, video generation

Abstract: Reconstructing 4D scenes from casually captured monocular video is vital for applications in virtual reality (VR) and embodied AI. Recent advances in 4D reconstruction and novel view synthesis have significantly propelled this capability. However, all existing methods depend on pixel-level supervision and cannot recover regions beyond visible camera limits. Consequently, we introduce a new paradigm that reconstructs both the “visible regions" from monocular input and the ``invisible regions" beyond observable camera boundaries. The most intuitive solution is to leverage video generation models as ”generative priors". However, their inherent stochasticity and large solution space prevent stable and consistent view synthesis. As a result, naively incorporating generative content into the reconstruction process often causes artifacts, especially when camera trajectories deviate significantly from the original video. To overcome these challenges, we present Panoramic Alignment for StraTegic ExpLoitation of Generative Priors (PASTEL) to mitigate inconsistencies brought by generative priors. Specifically, PASTEL first aligns the scene into a spherical panoramic representation. Within this space, it identifies trajectories that minimize deviation while maximizing exploration beyond observable boundaries. These trajectories enable generative priors for stable and consistent exploration beyond visible camera limits. PASTEL further designs a comprehensive view expansion with strategic 4D scene supervision to alleviate inconsistencies from generative priors. Experimental results show that PASTEL can not only extrapolate plausible scene content beyond the observable boundaries of input monocular videos, but also substantially boost monocular 4D reconstruction performance, outperforming the previous state-of-the-art method by 1.2dB in PSNR on the DyCheck Iphone dataset.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3046

Loading