M$^2$oT: Agglomerative Vision Foundation Models via Sparse Mixture-of-Experts

M$^2$oT: Agglomerative Vision Foundation Models via Sparse Mixture-of-Experts

ICLR 2026 Conference Submission16851 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-teacher Distillation, Mixture-of-Experts, Vision Foundation Model

TL;DR: We introduce a novel multi-teacher distillation method using a Sparse Mixture-of-Experts architecture with a specialization-oriented loss to solve the compromised trap for building an agglomerative vision foundation model.

Abstract: Agglomerative models aim to unify the strengths of various vision foundation models through multi-teacher distillation for enhanced performance across diverse tasks. However, current feature-aligned distillation approaches for agglomerative models frequently encounter a compromised trap: student models learn compromised features that overlook the unique contributions and inherent differences of individual teachers, limiting the overall performance of models. To mitigate this limitation, we propose a novel Sparse Mixture-of-Experts (SMoE) based framework for Multi-Teacher distillation (M$^2$oT). Within M$^2$oT, we introduce a teacher-aware loss as a regularization term to actively increase expert diversity, enabling the SMoE to capture specialized features tailored to each teacher's unique contributions. Extensive experiments have demonstrated the superior performance of our method across various large-scale vision tasks, validating its effectiveness in resolving the compromised trap and enhancing overall model performance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16851

Loading