Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

Shuo Wang; Yongcai Wang; Wanting Li; Xudong Cai; Yucheng Wang; Maiyue Chen; kaihui.wang; Zhizhong Su; Deying Li; Zhaoxin Fan

Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, kaihui.wang, Zhizhong Su, Deying Li, Zhaoxin Fan

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language navigation, chain of thought, data efficiency

Abstract: Vision-Language Navigation is a critical task for developing embodied agents that can follow natural language instructions to navigate in complex real-world environments. Recent advances by finetuning large pretrained models have significantly improved generalization and instruction grounding compared to traditional approaches. However, the role of reasoning strategies in navigation—an action-centric, long-horizon task—remains underexplored, despite Chain-of-Thought reasoning's demonstrated success in static tasks like question answering and visual reasoning. To address this gap, we conduct the first systematic evaluation of reasoning strategies for VLN, including No-Think (direct action prediction), Pre-Think (reason before action), and Post-Think (reason after action). Surprisingly, our findings reveal the Inference-time Reasoning Collaps issue, where inference-time reasoning degrades navigation accuracy, highlighting the challenges of integrating reasoning into VLN. Based on this insight, we propose Aux-Think, a framework that trains models to internalize structured reasoning patterns through CoT supervision during training, while preserving No-Think inference for efficient action prediction. To support this framework, we release R2R-CoT-320k, a large-scale Chain-of-Thought annotated dataset. Empirically, Aux-Think significantly reduces training effort without compromising performance.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 6774

Loading