Abstract: Unsupervised Semantic Segmentation (USS) aims to categorize image pixels into semantic groups without
relying on annotated data. While existing USS methods predominantly operate on RGB images and exploit
self-supervised Vision Transformers (ViTs) to model semantic correlations, their performance degrades severely
under adverse illumination due to the inherent limitations of the RGB modality. To address this challenge, we
propose DARTS, a novel multimodal framework that leverages complementary information from the thermal
spectrum alongside RGB inputs for unsupervised semantic segmentation. Observing that self-supervised ViTs
produce semantically consistent feature structures across modalities, we design a multimodal feature fusion
module equipped with a feature-correlation loss to learn clusterable and illumination-invariant representations
from RGB–thermal pairs. The fusion module integrates self- and cross-attention within a single dual-modal
ViT block to selectively extract complementary features, followed by a linear fusion mechanism for joint
representation learning. To guide the unsupervised training, we introduce intra- and inter-modal feature
correlation losses that contrast and distill features within and across modalities, encouraging compact and
semantically meaningful pixel embeddings. DARTS can be seamlessly integrated into existing USS pipelines
such as STEGO, SmooSeg, EAGLE, and DepthG, consistently enhancing their segmentation quality under
challenging illumination conditions. Extensive experiments on KP, PST900, MFNet and SemanticRT datasets
demonstrate that DARTS achieves superior performance over unimodal baselines, particularly in scenarios with
nighttime, glare, or low-visibility environments.
Loading