DCRM-ViT: Domain Conditioned Residual Modulation for Multi-Domain Vision Transformers

ICLR 2026 Conference Submission13793 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Domain Adaptation, Vision Transformers, Meta Learning
TL;DR: DCRM-ViT is a lightweight add-on to frozen Vision Transformers that uses domain-conditioned residual modulation and bi-level optimization to adapt to medical artifacts while preserving general image performance with minimal compute
Abstract: Medical imaging presents significant challenges due to acoustic shadows, motion blur, and indistinct boundaries. Addressing these issues is crucial for improving diagnostic accuracy. Many conventional vision models require extensive fine-tuning on task-specific data and often lose generalizability to natural-image domains. We propose DCRM-ViT, a domain-conditioned residual modulation framework for Vision Transformers that preserves general-vision capability while adapting to diverse domains. DCRM-ViT keeps the backbone frozen and augments each block with a lightweight Residual Modulation Block (RMB) whose parameters are synthesized per sample by a Domain Router (DR.) and Parameter Synthesizer Network (PSN). The router outputs soft domain weights from input features, whereas the synthesizer maps these weights to low-rank residuals that modulate selected projections and, optionally, add a domain-aware bias to attention. Crucially, we learn routing and modulation via a bi-level optimization scheme: a short inner loop adapts RMB parameters to task supervision, while an outer loop updates DR., PSN, and RMB initializations/step sizes so the synthesized residuals generalize across medical and natural domains. Across fine-grained classification (Food101, SUN397, Stanford Cars) and medical segmentation (ultrasound, CT, MRI), DCRM-ViT improves over strong baselines while using modest trainable compute. The ablation studies confirmed the benefits of our architectural enhancements, showing improved performance and adaptability. The results demonstrate DCRM-ViT's potential to offer high diagnostic performance with reduced computational overhead of using 0.3 training min/epoch. Our code will be publicly available upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13793
Loading