Distilling auxiliary RGB-T features for unsupervised semantic segmentation

Meena S Padnekar, Sukhendu Das

Published: 04 Apr 2026, Last Modified: 01 May 2026Image and Vision Computing (Elsevier)EveryoneCC BY 4.0

Abstract: Unsupervised Semantic Segmentation (USS) aims to categorize image pixels into semantic groups without relying on annotated data. While existing USS methods predominantly operate on RGB images and exploit self-supervised Vision Transformers (ViTs) to model semantic correlations, their performance degrades severely under adverse illumination due to the inherent limitations of the RGB modality. To address this challenge, we propose DARTS, a novel multimodal framework that leverages complementary information from the thermal spectrum alongside RGB inputs for unsupervised semantic segmentation. Observing that self-supervised ViTs produce semantically consistent feature structures across modalities, we design a multimodal feature fusion module equipped with a feature-correlation loss to learn clusterable and illumination-invariant representations from RGB–thermal pairs. The fusion module integrates self- and cross-attention within a single dual-modal ViT block to selectively extract complementary features, followed by a linear fusion mechanism for joint representation learning. To guide the unsupervised training, we introduce intra- and inter-modal feature correlation losses that contrast and distill features within and across modalities, encouraging compact and semantically meaningful pixel embeddings. DARTS can be seamlessly integrated into existing USS pipelines such as STEGO, SmooSeg, EAGLE, and DepthG, consistently enhancing their segmentation quality under challenging illumination conditions. Extensive experiments on KP, PST900, MFNet and SemanticRT datasets demonstrate that DARTS achieves superior performance over unimodal baselines, particularly in scenarios with nighttime, glare, or low-visibility environments.