Abstract: Unsupervised domain adaptation (UDA) aims to improve a model’s performance on an unlabeled target domain by leveraging labeled data from a source domain. Traditional UDA methods often overlook the rich textual information inherent in class labels, limiting their effectiveness. Recent advances in visual language models (VLMs), especially the contrastive language–image pretraining (CLIP) model, provide a promising multimodal foundation by integrating both image and text representations. However, most CLIP-based UDA approaches rely on pseudo-labels generated in a zero-shot manner by the pretrained CLIP model, which may become suboptimal as training progresses. To address this limitation, we propose a pseudo-label refinement (PuRe) framework that begins training with zero-shot CLIP pseudo-labels and then transitions to self-training, where we iteratively refine the model’s own predictions as pseudo-labels. In this iterative process, we introduce label consistency learning to stabilize predictions under strong data augmentations, increasing robustness, and an information maximization (IM) loss to encourage high-confidence predictions while preserving prediction diversity. Together, these components progressively enhance pseudo-label reliability, leading to improved adaptation in the target domain. Furthermore, we propose a novel prompt learning (PL) method that fine-tunes only a minimal set of parameters, specifically a single context token and class embeddings. Extensive experiments demonstrate that PuRe surpasses existing UDA methods on multiple benchmarks.
External IDs:doi:10.1109/tsmc.2026.3660284
Loading