EfficientNav: Towards On-Device Vision-Language Navigation with Navigation Map Caching and Retrieval

EfficientNav: Towards On-Device Vision-Language Navigation with Navigation Map Caching and Retrieval

ACL ARR 2025 February Submission3993 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Embodied agents equipped with large language models (LLMs) and online constructed navigation maps have demonstrated promising capability for zero-shot vision-language navigation (VLN) in unseen environments. However, existing agents heavily rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to limited model capacity for understanding complex navigation maps, which prevents deploying VLN on local devices. At the same time, the long prompt introduced by the navigation map description will cause high planning latency on local device. In this paper, we propose EfficientNav to enable on-device zero-shot VLN for the first time. To help the smaller LLMs better understand the environment, we propose semantics-aware memory retrieval to prune redundant information in navigation maps. To reduce planning latency, we propose discrete memory caching and attention-based memory clustering to efficiently save and re-use the KV cache. Extensive experimental results demonstrate that EfficientNav achieves 11.1\% improvement in success rate on Habitat ObjNav Challenge benchmark over GPT-4-based baselines, and demonstrates 6.7$\times$ real-time latency reduction and 4.7$\times$ end-to-end latency reduction over GPT-4 planner. Our code is available on Anonymous Github.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Vision-language Navigation, Navigation Map, Memory Caching, Memory Retrieval

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 3993

Loading