Keywords: Autonomous driving, Vision and Language, Semantic Understanding
TL;DR: We propose GENNAV, a polygon-based segmentation method which predicts the existence of target regions and generates segmentation masks for stuff-type target regions.
Abstract: We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility.
This task is challenging because it requires both existence prediction and segmentation mask generation, particularly for stuff-type target regions with ambiguous boundaries.
Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets.
To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions.
To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target.
GENNAV achieved superior performance over baseline methods on standard evaluation metrics.
Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance.
In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments.
Supplementary Material: zip
Spotlight: mp4
Submission Number: 236
Loading