Keywords: Scene Understanding; Mixture-of-Experts; Multi-task Learning;
Abstract: Multi-task dense scene understanding tasks require models to jointly reason over heterogeneous visual cues. While foundation vision models like SAM 2 provide strong general-purpose features, their extension to multi-task settings is limited by task interference and the lack of explicit task-aware routing mechanisms. In this paper, we present LangSAM, a novel language-guided mixture-of-experts framework built on top of SAM 2 for dense scene understanding. Our key idea is to leverage natural language task prompts to guide expert activation, thereby enabling more effective task-aware feature representation. Specifically, we encode each task prompt and design a text-guided router that fuses the global visual representation with the task embedding to produce task-aware gating signals.
These signals are combined with a token-level MoE gate, yielding a dual-gated mechanism that enables experts to specialize both spatially and semantically. To further enhance representation learning, LangSAM incorporates task-specific language-guided MoE blocks for coarse predictions and a shared language-guided MoE block that refines multi-task features by modeling global dependencies.
We evaluate LangSAM on two standard datasets, NYUD-v2 and PASCAL-Context, covering six dense prediction tasks including semantic segmentation, depth estimation, human part segmentation, saliency estimation, surface normal estimation, and boundary detection. Extensive experiments show that LangSAM consistently improves over strong SAM2 baselines and recent multi-task learning methods, highlighting the effectiveness of language-guided expert routing as a new paradigm for multi-task dense prediction. The code will be released.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 4111
Loading