SSM-PixNav: State Space Models for Pixel-Guided Embodied Navigation

TMLR Paper7914 Authors

13 Mar 2026 (modified: 23 May 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While navigating a robot towards a specific object or image, how do we ensure it focuses on the correct location we are referring to? Object goal navigation, image goal navigation, goal instance navigation and pixel navigation are popular approaches to solve the problem. In this work, we focus on pixel navigation as it solves the problem more precisely by providing the agent with additional pixel-level guidance. Prior work has largely relied on RGB input; as a result, the policy lacks explicit geometric awareness, which can be important when visually similar regions differ in navigability. Additionally, recent work leverages transformer-based architectures to model temporal dependencies in observations, thereby increasing computational cost. Another practical limitation is the absence of an open benchmark dataset to reproduce the baselines. Through this work, we address these limitations along three directions. First, we introduce an RGBD-PixNav policy, a transformer-based architecture that incorporates depth directly into the policy. Second, to improve temporal modelling while maintaining computational efficiency, we employ Mamba, a recent State Space Model (SSM) architecture that enables lightweight sequential scanning of the observations. Building on this, we develop Causal SSM-based navigation policies and introduce a depth gating mechanism to regulate the contribution of depth features during policy learning. Third, to facilitate reproducible evaluation and future research, we curate the PixNav Trajectories dataset using HM3D scenes in Habitat-sim. Through extensive experiments, we establish an RGB-only baseline and extend to a transformer-based RGBD model and SSM-based variants. Results show that the proposed Causal SSM-RGB PixNav and Causal SSM-RGBD PixNav with depth gate consistently outperform other policy variants, improving the success rate by $\approx$0.4 while reducing model size to just half, $\approx$27M parameters. The models also demonstrate robustness to observation noise and varying history length. Code and dataset will be publicly released.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank the reviewers for their constructive feedback and insightful suggestions. We have carefully revised the manuscript to address all concerns and improve clarity, rigor, and completeness. First, we have restructured the methodology section to clearly separate problem formulation, background, and our proposed contributions. The revised version improves readability and makes the novelty of our approach more explicit. To address concerns regarding dataset quality, we now include an explicit evaluation of the SPF-based oracle used for dataset generation. Following prior work, we report oracle success rates across difficulty levels, demonstrating that the generated trajectories provide strong and reliable supervision signals. For the robustness analysis, we agree that evaluating only RGB perturbations was insufficient. We have extended our experiments to include noise in both RGB and depth modalities, providing a more comprehensive assessment. The results show consistent degradation with increasing noise, highlighting the sensitivity of depth signals and motivating our design choices. Regarding depth integration, we acknowledge that channel concatenation is a standard approach. Our contribution lies not in proposing a new fusion mechanism, but in systematically evaluating depth in the pixel-navigation setting, where it has been largely unexplored. Furthermore, we extend this baseline with SSM-based architectures and depth gating, offering deeper insights into when and how depth is beneficial. An LSTM variant was trained and evaluated as per the reviewer comment, as expected this ablation have shown that LSTM help in short horizon or easier navigation task. Our plan on future work -sim to real covers testing the behaviour on a physical robot addressing the related challenges. This is explained in the revised draft. To understand the need for Causality, effect of depth gate separate subsections are introduced in Appendix to help the reader understand the concept better. Captioning, Figure related concerns are addressed. We have also clarified observations from our experiments: RGB-only models perform strongly in simpler settings, indicating that appearance cues are often sufficient. Depth can introduce noise or redundancy if not handled carefully, particularly in easier scenarios. Our gating-based formulation helps regulate this effect, demonstrating that selective use of depth is more effective than naïve fusion. In Figure 7 included the effect of noise on depth map. Adjusted the references.
Assigned Action Editor: ~Mengmi_Zhang1
Submission Number: 7914
Loading