Segmentation as a Plug-and-Play Capability for Frozen Multimodal LLMs

05 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning Segmentation, Computer Vision, MLLM
Abstract: Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model’s output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce *LENS* (**L**everaging k**E**ypoi**N**ts for MLLMs' **S**egmentation), a novel plug-and-play solution. *LENS* attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, *LENS* extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: *LENS* achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of *LENS* establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2277
Loading