Keywords: 4D reconstruction, feed-forward model, diffusion model, 3D Gaussian Splatting
TL;DR: We introduce a pose-free, feedforward framework for 4D scene reconstruction from unposed images.
Abstract: Autonomous vehicles require diverse dynamic scenes for robust training and evaluation, yet existing dynamic scene reconstruction methods are often limited by slow per-scene optimization and reliance on explicit annotations or camera calibration. In this paper, we introduce a pose-free, feedforward framework for 4D scene reconstruction that jointly infers camera parameters, dynamic Gaussian representations, and 3D motion directly from sparse, unposed images. Unlike prior feedforward approaches, our model accommodates an arbitrary number of input views, enabling long-sequence modeling and improved generalization. Dynamic objects are disentangled via estimated motion and aggregated into unified 3DGS representations, while a diffusion-based refinement module mitigates flow artifacts and enhances novel view synthesis under sparse inputs. Trained on the Waymo Dataset and evaluated on nuScenes and Argoverse2, our method achieves superior performance while generalizing effectively across datasets, benefiting from the pose-free design that reduces dataset-specific biases. Additionally, the framework supports instance-level scene editing and high-fidelity view synthesis, providing a scalable foundation for real-world autonomous driving simulation.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5342
Loading