Keywords: 4D Reconstruction, Gaussian Splatting
Abstract: Generating 4D objects is challenging because it requires jointly maintaining appearance and motion consistency across space and time under sparse inputs, while avoiding artifacts and temporal drift.
We hypothesize that this view discrepancy stems from supervision that relies solely on pixel- or latent-space video-diffusion losses and lacks explicitly temporally aware tracking guidance at feature-level.
To address this issue, we introduce \emph{Track4Animate3D}, a two-stage framework that unifies a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor.
The core idea is to explicitly inject motion priors from a foundation point tracker into the feature representation for both video generation and 4D-GS.
In Stage One, we impose dense, feature-level point correspondences within the diffusion generator, enforcing temporally consistent feature representations that suppress appearance drift and strengthen cross-view coherence.
In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion representation that concatenates co-located diffusion features (carrying tracking priors from Stage One) with Hex-plane features, and appends a 4D Spherical Harmonics modeling, improving higher-fidelity dynamics and illumination modeling.
\emph{Track4Animate3D} outperforms strong baselines (e.g., Animate3D, DG4D) across VBench metrics for multi-view video generation, CLIP-O/F/C metrics and user preference studies for 4D generation, producing temporally stable and text-editable 4D assets.
Finally, we curate a new high-quality 4D dataset named \emph{Sketchfab28}, to evaluate object-centric 4D generation for future research.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10559
Loading