Track: full paper
Keywords: Visual Language Models (VLMs), Multimodal content understanding, Language-guided aerial navigation, Landmark-based navigation, Map-based navigation
Abstract: Visual Language Models (VLMs) have shown impressive abilities in understanding and gen-
erating multimodal content by integrating visual and textual information. Recently, language-
guided aerial navigation benchmarks have emerged, presenting a novel challenge for VLMs. In
this work, we focus on the utilization of navigation maps, a critical component of the broader
aerial navigation problem. We analyze the CityNav benchmark, a recently introduced dataset for
language-goal aerial navigation that incorporates navigation maps and 3D point clouds of real
cities to simulate environments for drones. We demonstrate that existing open-source VLMs
perform poorly in understanding navigation maps in a zero-shot setting. To address this, we
fine-tune one of the top-performing VLMs, Qwen2-VL, on map data, achieving near-perfect
performance on a landmark-based navigation task. Notably, our fine-tuned Qwen2-VL model,
using only the landmark map, achieves performance on par with the best baseline model in the
CityNav benchmark. This highlights the potential of leveraging navigation maps for enhancing
VLM capabilities in aerial navigation tasks.
Presenter: ~Tigran_Galstyan1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 72
Loading