EfficientNav: Towards On-Device Vision-Language Navigation with Navigation Map Caching and Retrieval
Abstract: Embodied agents equipped with large language models (LLMs) and online constructed navigation maps have demonstrated promising capability for zero-shot vision-language navigation (VLN) in unseen environments. However, existing agents heavily rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to limited model capacity for understanding complex navigation maps, which prevents deploying VLN on local devices.
At the same time, the long prompt introduced by the navigation map description will cause high planning latency on local device.
In this paper, we propose EfficientNav to enable on-device zero-shot VLN for the first time. To help the smaller LLMs better understand the environment, we propose semantics-aware memory retrieval to prune redundant information in navigation maps.
To reduce planning latency, we propose discrete memory caching and attention-based memory clustering to efficiently save and re-use the KV cache.
Extensive experimental results demonstrate that EfficientNav
achieves 11.1\% improvement in success rate on Habitat ObjNav Challenge benchmark over GPT-4-based baselines,
and demonstrates 6.7$\times$ real-time latency reduction and 4.7$\times$ end-to-end latency reduction over GPT-4 planner. Our code is available on Anonymous Github.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Vision-language Navigation, Navigation Map, Memory Caching, Memory Retrieval
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3993
Loading