Moving Beyond Navigation with Active Neural SLAM
01 Dec 2021 | navigation reinforcement learning domain generalizationMoving Beyond Navigation with Active Neural SLAM
Introduction
The ability to effectively harness autonomous control in real-world 3D environments
largely depends on learning realistic navigation techniques for embodied agents. Advances to classical robotics call for efficient methods to navigate, viz., explore unseen simulations while transferring the attained performance to the real-world domain. Any autonomous agent in the wild should have the ability to make decisions on the fly, having explored its surroundings and being cognizant of sudden changes, as in the real world.

Apart from understanding the 3D scene, intelligent agents must explore new environments effectively that can prove pivotal to downstream tasks. Some of these include PointNav (PointGoal Navigation), wherein an agent has to reach a given target coordinate within a fixed time budget, and ObjectNav (ObjectGoal Navigation), which requires semantic knowledge of the scene, as it focuses on recognition of instances/classes in the environment. On a high-level, exploration for autonomous navigation consists of three sub-tasks [1]:
- Mapping, with long-range memory for long-horizon tasks
- State Estimation, including position, orientation, velocity.
- Path-Planning, devising a possible path to the goal location.
The breakthroughs in classical robotics relied only on Mapping [2] and Path-Planning [3], further inspiring learning-based methods [1, 4] to learn exploration policies using deep neural networks. A recent work [4] learns exploration policies for navigation in an end-to-end
setting, focusing primarily on the architecture, reward function design, and optimization procedure to overcome classical methods which heavily rely on geometry from sensor data, hence being susceptible to drift. They also extend beyond prior learning-based techniques by focusing on task-agnostic autonomous exploration with sensor-driven reward signals and learned maps instead of goal-oriented navigation, making this approach realizable in a real-world setup. Although the proposed learning policy tackles complex-scene navigation by dynamic updates to determine the agent’s immediate action, this end-to-end framework that employs imitation learning
depends on having millions of frames of experience. As a result, it has high sample complexity
, which explodes with the size of the scene (area, number of floors).
Motivation
The work from Chen et al. [4] proved to be highly beneficial for realistic exploration, motivated primarily by the following claims: The use of learning:
- From RGB input streams to allow flexibility w.r.t input modalities that removed the necessity for geometry data from custom sensors.
- To ensure robustness to state estimation errors while determining agent pose.
- To use structural regularities in training scenes to adapt to unseen environments.
in an end-to-end manner. Active Neural SLAM (ANS) [1] banks on the observation that these stages of learning an effective navigation pipeline occur at different timesteps
, breaking this approach down in a modular and hierarchical
manner.
Active Neural SLAM (ANS) [1] consists of three main modules: a learned Neural SLAM module
, a global policy
, and a local policy
, which is guided through an intermediate map representation
, and an analytical path planner
. In this way, the learning-based techniques are utilized best while receiving aid from the analytical path planner to improve performance and sample efficiency.
-
Learned Neural SLAM Module: It consists of a
Mapper
and aPose Estimator
. The Mapper is trained with ground-truth supervision (Cross-entropy Loss
) to predict a free-space egocentric (first-person point-of-view) spatial map of the explored 3D environment. It encodes the visual RGB stream and decodes the embedding to create the required map, determining the probabilities of obstacles in the scene and the area explored by the agent. The Pose Estimator determines the agent pose in a geocentric frame (earth-centered point-of-view) with supervision (Mean Squared Error Loss
), using past sensor estimates and the egocentric map prediction. -
Global Policy: Using the predicted agent pose and spatial map in the previous step, a policy is learned using
PPO
(Proximal Policy Optimisation, a reinforcement learning method to learn optimal policy) to obtain a long-term goal, with an ambitious intent of traversing larger geodesic distances to reach distant goals. The agent attempts to maximize thecoverage area
while exploring the 3D environment.
The determined long-term goal is then broken down to a short-term goal, using an analytical path planner, such as Dijkstra’s Algorithm
. Along the shortest path to the goal, the short-term goal point is sampled at a distance of 0.25m
away from the agent’s current position.
- Local Policy: Trained to mimic via Imitation Learning, the local policy uses the input RGB data and binned relative headings (distance and angle) to the short-term goal to decide the most likely action to reach the short-term destination.
Hence, the three-fold motivation of Chen et al. [4] is crafted in a modular, sample-efficient manner with both learning and classical methods. More specifically,
- learning in the Neural SLAM module helps leverage any input modality,
- the global policy intelligently learns from structural regularities in the environment, and
- the local policy uses additional visual inputs to ensure robustness to state estimation errors.
Previous works failed to focus on visual and physical realism of both the simulation and agent motion, using discrete 90-degree rotations, etc. [5]. To make motion even more realistic, ANS implements actuation and sensor noise models in the Habitat simulator
after using a LoCoBot: a ROS research rover to collect navigational data and fit Gaussian Mixture Models
to be implemented in the simulator for all experiments.
Impact
While the results obtained may not be as substantial as what qualifies for “ground-breaking research,” the notion of breaking down the process of exploring complex 3D scenes into a modular hypothesis enables:
-
Doing away with RL’s
high sample complexity
: ensures sample efficiency, achieving a speedup ofover 75x
against the best RL baseline, with just 1 million data samples required to produce competitive performance. -
Domain Generalization: Active Neural SLAM (ANS) [1] is trained on the
Gibson
[6] domain and is transferred to theMatterport3D (MP3D)
[7] domain. This method trained on the Gibson training set generalizes well in terms ofCoverage Area
(in m2) and% Coverage
in the total scene area on both small and large scenes from the Gibson [6] validation set. Although performance drops on large MP3D [7] scenes (significantly larger), the method transfers well to smaller MP3D [7] scenes of comparative size, completely unseen to the model. The following observation holds for both training objectives of Exploration and PointGoal Navigation. Finally, the proposed method is also tested on a LoCoBot in the real world. It effectively explores the scene upon modifying camera intrinsics to match the Habitat [8] simulator, generalizing it to the real-world domain. -
Task Generalization: The exploration models can be retrained or fine-tuned on the PointGoal navigation task by fixing the long-term goal output of the Global Policy, as the goal coordinates to be reached within the defined period. Experiments revealed that ANS transferred well to the task, outperforming the best RL and Imitation Learning baselines, even to more challenging goal locations (with larger geodesic distance) and the MP3D [7] domain. Since the learned Neural SLAM module and local policy are
invariant to the target task
, knowledge can be effortlessly distilled for low-level navigational control and obstacle-aware movement in new environments.
Future Extensions
Active Neural SLAM (ANS) [1] opens a door for a wide range of possibilities:
- Scene understanding using ANS [1]: Previous works such as
Semantic Curiosity
[9] andSemantic Exploration
[10] focus on detecting surrounding object semantics and exploring unseen environments based on a target instance or class. However, the former [9], which pushes detection and segmentation on static datasets to an active learning framework, assumes labeled trajectories and does not demonstrate real-world domain transfer, proving pivotal to autonomous navigational control. Furthermore, there is scope for improving the Semantic Exploration [10] method that incurs errors for accurate localization of goal objects and discrepancies in segmentation predictions. Instead, incorporatinguncertainty in mapping
for the class-specific channels of the intermediate semantic map [11] to select the goal. An uncertainty-driven navigation policy exploitingsemantic hallucination
rather than target-specific objectives could remove bias from an information deficit in unseen environments.

- Moving beyond just navigation using ANS [1]: Embodied agents in simulated 3D environments are endowed with the flexibility to learn regularities in their surroundings, to be transferred across domains and tasks. Using
ANS
[1] for navigation to a goal coordinate or Semantic Exploration [10] to navigate to a given class instance could be interleaved with the task ofmotion prediction
(or)synthesis
on the fly. Recent work on human motion synthesis [12] generates realistic motion between two points, usingscene
andagent pose
information as inputs. The long-term or short-term goal in conjunction with the map and pose from the learned Neural SLAM module could be leveraged for this purpose. Navigation and motion synthesis for long-horizon tasks will pose heavy compute dependence with unbounded scenes in the wild. Current approaches use theConditional Variational Auto-Encoder (CVAE)
framework, generating motion of human agents in an auto-regressive setting usingLong short-term memory (LSTM)
withFully-connected layers
[12], andTransformer-based VAE
[13] for action-conditioned motion.Structured State Spaces
[14] can be studied to model longer sequences efficiently for long-range motion synthesis to model unbounded motion sequences with varying lengths up to60x faster
, in linear time and memory with substantially lesser parameters. In theory, this can aid applications in Graphics, such as Virtual Reality (VR) and video games, to learn holistic agent-scene interaction. For example, a learned agent to navigate >=2 floors in a complex environment to reach a specific object instance in a given room, perform a user-specified action such as“sit on the sofa in the living room of the ground floor”
, while synthesizing motion dynamically). Additional agents could also be incorporated for realistic gaming and VR experiences, using methods such as in [15]. The figures below serve as a high-level overview of a plausible pipeline that could be built upon ANS [1], solely for illustrative purposes. We also create a 3D scene with additionally generated agents [15] for a visual representation.


- ANS [1] is trained and tested on closed 3D scenes in Habitat. However, the generalization of this method using Habitat to open scenes, such as ones from the
PROX
[16] and selective scenes fromReplica
[17], remains in doubt. In such cases, the 3D scene could be manipulated, such as closing the meshes by constructing a ceiling, etc. Further study is required, concerning this methods’ dependence on scene characteristics by moving beyondGibson
[6] andMP3D
[7] scenes. Sample 3D scenes from these datasets are illustrated below, highlighting the difference between closed and open scenes.


References
[1] Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A. and Salakhutdinov, R., 2020. Learning To Explore Using Active Neural SLAM. In International Conference on Learning Representations (ICLR)
[2] Hartley, R.I., & Zisserman, A. (2009). Multiple View Geometry. Encyclopedia of Biometrics.
[3] Faverjon, B., & Tournassoud, P. (1996). Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces.
[4] Chen, T., Gupta, S. and Gupta, A., 2019. Learning Exploration Policies for Navigation. In International Conference on Learning Representations (ICLR)
[5] Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A.K., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. 2017 IEEE International Conference on Robotics and Automation (ICRA), 3357-3364.
[6] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik and S. Savarese, “Gibson Env: Real-World Perception for Embodied Agents,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 9068-9079, doi: 10.1109/CVPR.2018.00945.
[7] Chang, A.X., Dai, A., Funkhouser, T.A., Halber, M., Nießner, M., Savva, M., Song, S., Zeng, A., & Zhang, Y. (2017). Matterport3D: Learning from RGB-D Data in Indoor Environments. 2017 International Conference on 3D Vision (3DV), 667-676.
[8] Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra, D. (2019). Habitat: A Platform for Embodied AI Research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 9338-9346.
[9] Chaplot, D.S., Jiang, H., Gupta, S., & Gupta, A.K. (2020). Semantic Curiosity for Active Visual Learning. ECCV.
[10] Chaplot, D.S., Gandhi, D., Gupta, A. and Salakhutdinov, R. 2020. Object Goal Navigation using Goal-Oriented Semantic Exploration. In Neural Information Processing Systems (NeurIPS-20).
[11] Georgakis, G., Bucher, B., Schmeckpeper, K., Singh, S., & Daniilidis, K. (2021). Learning to Map for Active Semantic Goal Navigation. ArXiv, abs/2106.15648.
[12] Wang, J., Xu, H., Xu, J., Liu, S., & Wang, X. (2021). Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9396-9406.
[13] Petrovich, M., Black, M.J., and Varol, G. (2021). Action-Conditioned 3{D} Human Motion Synthesis with Transformer VAE. 2021 International Conference on Computer Vision (ICCV)
[14] Gu, A., Goel, K., & R’e, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ArXiv, abs/2111.00396.
[15] Zhang, Y., Hassan, M., Neumann, H., Black, M.J., & Tang, S. (2020). Generating 3D People in Scenes Without People. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6193-6203.
[16] Hassan, M., Choutas, V., Tzionas, D., & Black, M.J. (2019). Resolving 3D Human Pose Ambiguities With 3D Scene Constraints. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2282-2292.
[17] Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J., Mur-Artal, R., Ren, C.Y., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., & Newcombe, R.A. (2019). The Replica Dataset: A Digital Replica of Indoor Spaces. ArXiv, abs/1906.05797.