SIRE: SE(3) Intrinsic Rigidity Embeddings

TMLR Paper5337 Authors

08 Jul 2025 (modified: 09 Nov 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure -- highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can pave the way towards self-supervised learning of priors over geometry and motion rigidity from large-scale video data.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: • **Full forward pass notation and descriptions (Main Sec. 3.5).** We added a compact description of the full SIRE forward pass, covering per-frame prediction of depth/rigidity, the SE(3) track solve, chaining of per-track SE(3) motions, and the SE(3)-induced dense-in-time re-projection loss, also clarifying the frame of reference. • **Architecture Illustration and Description (Supp. Sec. 1).** We illustrated the network's predictions in the forward pass and detailed the pre-training or initialization of each component. • **Ablations: Robustness under bad tracks, depth supervision, and embedding dimensions (Supp. Sec. 3, 1.1).** We ran SIRE on a video sequence where a duck floats over water, inducing noisy and lost point tracks (illustrated). We show that without depth supervision in this challenging case and when trained just on this single video (no dataset-wide priors), our model can struggle to coherently group rigid bodies, and that when adding depth supervision, the rigid bodies become intuitively and coherently grouped. We also ablate and describe the choice of rigidity embedding dimension one the CO3D-Dogs dataset and explain this hyper parameter has a 'sweet spot' between over and under parameterized dimensions. • **Additional multi-object datasets (Supp. Sec. 2).** We ran SIRE on two more multi-object datasets: one of robot gripper demonstration videos and another on highway driving scenes. We visualize predicted rigidity embeddings and geometry, highlighting that our model can predict meaningful rigid object segmentations, especially on the robot dataset, but can struggle to segment multiple moving bodies when they exhibit parallel motion (cars dataset). • **Typos & notation cleanup.** We fixed observed typos and took a thorough pass over the text to increase clarity in notation. • **Code release.** We will publicly release code upon acceptance to further benefit the vision community.
Assigned Action Editor: ~David_Fouhey2
Submission Number: 5337
Loading