LVSPM: Long Sequence View Synthesis and Pose Estimation Model

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: novel view synthesis, pose estimation, test-time training
TL;DR: Our feedforward model can perform novel view synthesis and pose estimation on long image sequences.
Abstract: We present LVSPM, a generalizable model that jointly estimates camera poses and synthesizes novel views from uncalibrated image collections. Unlike prior approaches that rely on dense geometric supervision, LVSPM is trained only with RGB images and pose supervision, avoiding the need for dense 3D ground truth. LVSPM employs test-time training (TTT) layers, enabling efficient compression of tokens into fixed-size hidden states and scaling seamlessly to hundreds of input views. Experiments on RealEstate10k, Co3Dv2 and DL3DV, LVSPM surpasses VGGT in pose estimation across 10–256 input views. For novel view synthesis, LVSPM achieves state-of-the-art results in pose-free long-sequence rendering of the large baseline dataset DL3DV, and even exceeds pose-dependent models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4578
Loading