Harnessing Input-adaptive Inference for Efficient Vision-and-Language Navigation

Dongwoo Kang; Akhil Perincherry; Zachary Coalson; Stefan Lee; Sanghyun Hong

Harnessing Input-adaptive Inference for Efficient Vision-and-Language Navigation

Dongwoo Kang, Akhil Perincherry, Zachary Coalson, Stefan Lee, Sanghyun Hong

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-and-Language Navigation, Input-adaptive Efficient Navigation

TL;DR: We present a novel input-adaptive inference method for efficient vision-and-language navigation.

Abstract: An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models take observation and history as input and predict the most appropriate action for an agent. While employing these models has significantly improved performance, the scale of these models can be a bottleneck in practical settings where computational resources are limited (e.g., in robots). In this work, we present a novel input-adaptive navigation method for efficient VLN. We first characterize the overthinking problem in VLN and show that none of the existing input-adaptive mechanisms successfully reduce overthinking without causing significant performance degradation. Our method addresses this problem by developing three adaptive algorithms deployed at different levels: (1) We develop an adaptive approach that improves spatial efficiency; we only process a subset of panoramic views at each observation of an agent. (2) We also achieve model-level efficiency by developing adaptive thresholding for the early-exit method we employ, based on the importance of each view in navigation. (3) To achieve temporal efficiency, we design a caching mechanism to avoid processing views that an agent has seen before. In evaluations with six VLN benchmark tasks, we demonstrate over a 2$\times$ reduction in computation across two off-the-shelf VLN agents.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8029

Loading