Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning

ACL ARR 2025 February Submission2743 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large models have demonstrated state-of-the-art performance on Vision-and-Language Navigation tasks, but their high computational cost limits deployment in hardware-constrained environments. Token pruning reduces computation by decreasing the size of navigation inputs, offering a promising solution. However, in VLN tasks, input pruning can lead to information loss, causing the agent to take longer paths to determine when to stop, thus increasing computational demands and limiting efficiency gains. Moreover, attention-based pruning for instructions often fails to discard non-critical words, misspending valuable token budget. improve navigation efficiency and address these challenges, we prune the navigation input from three angles. First, we divide the panoramic views into action and background tokens, preserving key information for action prediction while improving navigation efficiency by pruning the background views. Second, we prune nodes from the agent’s navigation map to discourage backtracking and shorten paths. Finally, we leverage a Large Language Model to assess word importance in instructions, enabling us to accurately prune non-essential words. Experimental results show our methods significantly outperformed state-of-the-art pruning strategies in FLOPS efficiency, while maintaining higher accuracy across diverse VLN models and datasets.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, pruning
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2743
Loading