Trimodal Navigable Region Segmentation Model: Grounding Navigation Instructions in Urban Areas

Naoki Hosomi, Shumpei Hatanaka, Yui Iioka, Wei Yang, Katsuyuki Kuyo, Teruhisa Misu, Kentaro Yamada, Komei Sugiura

Published: 01 Jan 2024, Last Modified: 08 Jun 2024IEEE Robotics Autom. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this study, we develop a model that enables mobilities to have more friendly interactions with users. Specifically, we focus on the referring navigable regions task in which a model grounds navigable regions of the road using the mobility's camera image and natural language navigation instructions. This task is challenging because of the requirement of vision-and-language comprehension in situations that involve rapidly changing environments with other mobilities. The performance of existing methods is insufficient, partly because they do not consider features related to scene context, such as semantic segmentation information. Therefore, it is important to incorporate these features into a multimodal encoder. In this study, we propose a trimodal (three modalities of language, image, and mask) encoder-decoder model called the Trimodal Navigable Region Segmentation Model. We introduce the Text-Mask Encoder Block to process semantic segmentation masks and the Day-Night Classification Branch to balance the input modalities. We validated our model on the Talk2Car-RegSeg dataset. The results demonstrated that our method outperformed the baseline method for standard metrics.