VIT-MOQ: REVISITING MOMENTUM QUEUES FOR RESOURCE-EFFICIENT VISION TRANSFORMERS AND DOMAIN GENERALIZATION IN SELF- SUPERVISED LEARNING

Atharva Phatak; Sven Magg; Stefan Wermter

VIT-MOQ: REVISITING MOMENTUM QUEUES FOR RESOURCE-EFFICIENT VISION TRANSFORMERS AND DOMAIN GENERALIZATION IN SELF- SUPERVISED LEARNING

Atharva Phatak, Sven Magg, Stefan Wermter

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-supervised learning, Domain Generalization, Compute Efficiency, Vision Transformer, MoCo, Deep metric learning

TL;DR: Introduce Momentum Queues back into SSL with ViT backbone for efficient computation and better domain generalization.

Abstract: Self-supervised learning (SSL) has achieved remarkable success in computer vision, but current state-of-the-art methods require substantial computational resources with large batch sizes (4096) and multi-GPU clusters, limiting accessibility for many researchers. We present ViT-MoQ, a compute-efficient contrastive SSL method that reintroduces momentum queues to Vision Transformer architectures. Our key insight is that symmetric encoder architectures are essential for queue-based learning in ViTs, contrary to the asymmetric designs prevalent in recent SSL methods. ViT-MoQ achieves competitive performance while requiring only a single consumer GPU. On ImageNet-1K linear probing, ViT-MoQ achieves SOTA comparable performance on as few as 240 GPU hours. More interestingly, we show superior domain generalization capabilities: when trained on DomainNet Real, ViT-MoQ significantly outperforms MoCo variants across all tested domains (e.g., 44.42% vs 28.4% on painting, 44.81% vs 0.6% on quickdraw). Our work challenges the assumption that momentum queues are obsolete in the transformer era and demonstrates that architectural compatibility, not inherent limitations, was the barrier to their adoption. ViT-MoQ democratizes SSL research by making high-quality self-supervised learning accessible on modest hardware while learning more transferable, domain-agnostic representations and enabling sustainable, green AI research practices. Code will be published.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 10990

Loading