SNI-SLAM++: Tightly-Coupled Semantic Neural Implicit SLAM

Siting Zhu, Guangming Wang, Hermann Blum, Zhong Wang, Ganlin Zhang, Daniel Cremers, Marc Pollefeys, Hesheng Wang

Published: 2026, Last Modified: 26 Feb 2026IEEE Trans. Pattern Anal. Mach. Intell. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose SNI-SLAM++, a tightly-coupled semantic SLAM system utilizing neural implicit representation, that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking. Our system tightly integrates visual appearance, geometry, and semantics through five key components: (i) We introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. (ii) To fully utilize the correlation between multiple attributes of the environment, we integrate appearance, geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment, thereby allowing SNI-SLAM++ to remain robust even when single attribute is defective. (iii) We design an internal fusion-based decoder to obtain semantic, RGB, and Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. (iv) We introduce a semantics-coupled tracking framework that tightly incorporates semantic constraints for camera pose estimation in neural implicit SLAM. This framework leverages the multi-view consistency of semantics to construct a pose graph and perform semantic loop closure optimization, enabling robust tracking. (v) We propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss, our feature loss is capable of guiding the network optimization on a higher level. Our SNI-SLAM++ demonstrates superior performance over all recent visual SLAM methods in terms of mapping and tracking accuracy on the datasets of Replica, ScanNet, TUM-RGBD, and ScanNet++, while also showing excellent capabilities in accurate semantic segmentation and 3D semantic mapping.
Loading