A Universal Self-Supervised Paradigm via 3D Gaussian Splatting

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised learning, gaussian splatting, 3D vision
TL;DR: We propose a 3D Gaussian Splatting-based universal self-supervised framework, which bridges 2D and 3D modalities and enables pre-training of both 2D and 3D encoders.
Abstract: Pre-training on large-scale unlabeled datasets has proven effective for enhancing model performance on downstream tasks, particularly when annotated data is scarce. However, due to the inherent discrepancies in data structures across modalities, most existing self-supervised approaches are tailored to either 2D or 3D networks, limiting their generalizability. In this paper, we propose GS$^3$, a 3D Gaussian Splatting (GS)-based universal self-supervised framework, which bridges 2D and 3D modalities and enables pre-training of both 2D and 3D encoders. The core idea is to formulate neural rendering as a pretext task: visual features extracted from input data are used to predict scene-level 3D Gaussians, which are then rendered into images via a fast tile-based rasterizer. The model is optimized by minimizing the difference between rendered and real images, with a masked modeling strategy further encouraging robust and spatially-aware representation learning. We evaluate GS$^3$ across five representative downstream tasks, including detection, segmentation, and reconstruction. Experimental results show that GS$^3$ consistently achieves performance on par with or surpassing state-of-the-art methods, while significantly reducing memory overhead compared to prior NeRF-based approaches.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11640
Loading