RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation

Zeyuan Yang, Jiageng Lin, Peihao Chen, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan

Published: 2024, Last Modified: 14 Nov 2024CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We leverage Large Language Models (LLM) for zero-shot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for rein-forcement learning, yet achieve relatively low success rates and lack generalizability. The intermittent nature of au-ditory signals further poses additional obstacles to infer-ring the goal information. To address this challenge, we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sen-sory data, we instruct an LLM-based planner to actively ex-plore the environment. During the exploration, our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally, we introduce an auxiliary LLM-based assistant to enhance global environmental compre-hension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analy-sis, we show that our method outperforms relevant base-lines without training demonstrations from the environment and complementary semantic information.