TerraX: Visual Terrain Classification Enhanced by Vision-Language Models

Hongze Li, Xuchuan Huang, Xinhai Chang, Jun Zhou, Huijing Zhao

Published: 2025, Last Modified: 28 Feb 2026IROS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual Terrain Classification (VTC) plays a vital role in enabling unmanned ground vehicles to understand complex environments. Existing research relies on image-label pairs annotated by static label sets, where semantic ambiguity and high annotation costs constrain fine-grained terrain characterization. These limitations hinder the model’s adaptation to real-world terrain diversity and restrict its applicability. To address these issues, we propose TerraX, a vision-language learning framework that integrates multi-modal image-label-text data, unifying structured annotations with fine-grained natural language descriptions. The framework introduces a composite dataset TerraData, an evaluation benchmark suite TerraBench, and a CLIP-based visual terrain classification model TerraCLIP. TerraData aggregates multi-source terrain images from public and self-collected datasets, annotated through a VLM-based vision-language data annotation pipeline. TerraBench defines three evaluation benchmarks to systematically assess model robustness and adaptability in real-world terrain classification scenarios. Built on the CLIP model, TerraCLIP utilizes multi-granularity contrastive loss and LoRA fine-tuning to enhance understanding for terrain categories and attributes, and incorporates confidence-weighted inference for accurate predictions. Extensive experiments across benchmarks and real-world platforms demonstrate that our approach significantly enhances VTC performance, highlighting its potential for deployment in complex environments.
Loading