Keywords: Vision-Based Navigation, Collision Avoidance, Catastrophic Forgetting, Depth Estimation
TL;DR: A fine-tuning method for visual navigation that efficiently adapts to novel environments and learns geometric cues while preserving the diverse action-distribution priors of a pre-trained navigation foundation model
Abstract: Navigation Foundation Models (NFMs) trained on large, cross-embodied datasets have demonstrated powerful generalizability on various scenarios. Adopting in-domain fine-tuning upon an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, such model updates in a small subset of data typically erode the pretrained prior, compromising the pretraining generalization. Consequently, fine-tuning rather deteriorates the model's capability of robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages the large-scale pretraining while efficiently learning novel setups, such as the environment or camera configuration. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pretrained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire the in-domain geometry while preserving the pretrained knowledge for various behaviors. Despite the simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimum collisions or human intervention.
Supplementary Material: zip
Submission Number: 11
Loading