Learning multimodal adaptive relation graph and action boost memory for visual navigation

Jian Luo, Bo Cai, Yaoxiang Yu, Aihua Ke, Kang Zhou, Jian Zhang

Published: 15 Jul 2024, Last Modified: 08 Jun 2025Advanced Engineering InformaticsEveryoneCC BY-NC-ND 4.0

Abstract: The task of visual navigation (VN) is steering the agent to find the target object only using visual perceptions. Previous works largely exploit multimodal information (e.g., visual and training memory) to improve the environmental perception ability, while making less effort to leverage interchange information. Besides, multimodal fusion tends to ignore the data dependencies (prefer a part of the modal data) as well as the supervision of the action. In this work, we present a novel multimodal graph learning (MGL) structure for VN, which consists of three parts. (1) The multimodal fusion exploits the rich information across spatial, RGB, and depth information about objects’ place, as well as semantic information about their categories. (2) An adaptive relation graph (ARG) is dynamically built using object detectors, which encodes multimodal fusion and adapts to a novel environment. It embeds its navigation history and other useful task-oriented structural information, thus enabling the agent to possess association ability and make advisable informed decisions. (3) The action boost module (ABM) aims to assist the agent in making intelligent decisions, predicting more accurate actions using beneficial training experience. Our agent can foresee what the goal state may look like and how to get closer to that state. These combinations of the “what” and the “how” allow the agent to navigate to the target object effectively. We validate our approach on the AI2-THOR dataset. It reports 24.2\% and 23.7\% increase in SPL (Success weighted by Per Length) and SR (Success Rate) compared with baselines, respectively. Code and datasets can be found at \texttt{https://github.com/luosword/ABM_VN}.