Lang2LTL-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models

Published: 24 Oct 2024, Last Modified: 06 Nov 2024LEAP 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: robotics language grounding, LLM, VLM, formal methods
TL;DR: Grounding spatiotemporal navigation commands using large language and vision-language models in novel indoor and outdoor environments with retraining on language data
Abstract: Grounding spatiotemporal navigation commands to structured task specifications enables autonomous robots to understand a broad range of natural language and solve long-horizon tasks with safety guarantees. Prior works mostly focus on grounding spatial or temporally extended language for robots. We propose Lang2LTL-2, a modular system that leverages pretrained large language and vision-language models and multimodal semantic information to ground spatiotemporal navigation commands in novel city-scaled environments without retraining. Lang2LTL-2 achieves 93.53% language grounding accuracy on a dataset of 21,780 semantically diverse natural language commands in unseen environments. We run an ablation study to validate the need for different modalities. We also show that a physical robot equipped with the same system without modification can execute 50 semantically diverse natural language commands in both indoor and outdoor environments.
Submission Number: 2
Loading