Toward Spatial Intelligence: A Unified Self-supervised Framework for 3D Representation Learning from Unposed Multi-View Images

Bo Zhou; Qiuxia Lai; Zeren Sun; Xiangbo Shu; Yazhou Yao; Wenguan Wang

Toward Spatial Intelligence: A Unified Self-supervised Framework for 3D Representation Learning from Unposed Multi-View Images

Bo Zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang

19 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-supervised learning, 3D Gaussian splatting, Feed-forward 3D reconstruction, Spatial intelligence

TL;DR: UniSplat is a self-supervised, feed-forward framework that jointly learns geometry, appearance, semantics, and camera calibration from unposed multi-view images.

Abstract: Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce $\textbf{\textit{UniSplat}}$, a self-supervised framework designed to address these limitations through three complementary components. First, we propose a $\textit{dual-masking strategy}$ that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a $\textit{coarse-to-fine Gaussian splatting strategy}$ that enhances appearance learning by progressively refining the radiance field, thereby enhancing appearance detail to produce high-fidelity representations.Finally, to enforce geometric–semantic consistency, we introduce a \textit{pose-conditioned recalibration mechanism} that interrelates the outputs of multiple heads by reprojecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency and resolving geometry–semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.Finally, to enforce geometric–semantic consistency, we introduce a $\textit{pose-conditioned recalibration mechanism}$ that interrelates the outputs of multiple heads by reprojecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency and resolving geometry–semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 15441

Loading