Teaching Visual Language Models to Navigate using Maps

Tigran Galstyan; Hakob Tamazyan; Narek Nurijanyan

Teaching Visual Language Models to Navigate using Maps

Tigran Galstyan, Hakob Tamazyan, Narek Nurijanyan

Published: 28 Feb 2025, Last Modified: 17 Apr 2025WRL@ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: full paper

Keywords: Visual Language Models (VLMs), Multimodal content understanding, Language-guided aerial navigation, Landmark-based navigation, Map-based navigation

Abstract: Visual Language Models (VLMs) have shown impressive abilities in understanding and gen- erating multimodal content by integrating visual and textual information. Recently, language- guided aerial navigation benchmarks have emerged, presenting a novel challenge for VLMs. In this work, we focus on the utilization of navigation maps, a critical component of the broader aerial navigation problem. We analyze the CityNav benchmark, a recently introduced dataset for language-goal aerial navigation that incorporates navigation maps and 3D point clouds of real cities to simulate environments for drones. We demonstrate that existing open-source VLMs perform poorly in understanding navigation maps in a zero-shot setting. To address this, we fine-tune one of the top-performing VLMs, Qwen2-VL, on map data, achieving near-perfect performance on a landmark-based navigation task. Notably, our fine-tuned Qwen2-VL model, using only the landmark map, achieves performance on par with the best baseline model in the CityNav benchmark. This highlights the potential of leveraging navigation maps for enhancing VLM capabilities in aerial navigation tasks.

Presenter: ~Tigran_Galstyan1

Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 72

Loading