EmbodiedGS: Reconstruct Unified Embodied Representation from RGB Stream

10 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Reconstruction, 3D Instance Segmentation
Abstract: This paper addresses the challenge of incrementally reconstructing object-centric 3D representations from only a pose-free RGB video stream. Existing dense SLAM methods face a dual challenge: they are constrained by a reliance on precise camera poses and RGB-D input for initialization, and they lack precise instance-level scene understanding. Moreover, the quality of their reconstruction and perception is fragile to systematic errors. To this end, we propose EmbodiedGS, a pipeline that jointly performs incremental 3D reconstruction and perception from RGB stream to constructs an Object-Centric 3D Gaussians (OCGS) representation that is both geometrically accurate and rich with instance-level information. Specifically, our approach leverages MASt3R-SLAM for Gaussian geometric initialization and introduces a Global-Associated Instance Memory (GAIM) to consistently track objects across views using multi-modal cues. We then construct the initial OCGS by lifting instance information to 3D Gaussians via optimizable binary embeddings. Finally, this representation is refined through a joint optimization process that leverages the synergy between reconstruction and perception to mutually correct inaccuracies, yielding a robust, high-fidelity OCGS. Extensive experiments are conducted on TUM-RGBD and ScanNet datasets and a real-world robotic platform, where EmbodiedGS demonstrates competitive performance even compared with RGB-D SLAM methods and offline 3D instance segmentation methods.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3673
Loading