Segmentation From Attention: Training-Free Layer Selection and One-Shot Tuning for Segmentation in VLMs
Abstract: Large-scale vision-language models (VLMs), trained on extensive datasets of image-text pairs, exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions. This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps, without necessarily training on abundant labeled segmentation datasets. However, performance of such methods depends heavily on prompt engineering and manually selected layers or head choices for the attention layers. In this work, we propose a training-free entropy-based measure, InfoScore, to identify the best image-text attention layers for segmentation, providing a more flexible and scalable solution for training-free open-vocabulary segmentation, reducing the additional burden of hyperparamter search. We empirically show that our training-free selection strategy is superior to naive selection strategies. Additionally, we demonstrate that instead of solely relying on text prompts, fine-tuning the image-text attention layer with a single visual example of each class significantly improves segmentation without the need of additional parameters or decoders. Moreover, we show that our methods and findings are general and can be applied across various vision-language models (VLMs). Our code will be released upon acceptance.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ## Changes Since Last Submission
We have revised the manuscript to address the reviewers’ comments and suggestions. All changes are highlighted in **blue** in the revised manuscript. The major revisions are summarized below, organized by reviewer.
---
### Reviewer guZd
- **Limitations discussion**: Added a new *Limitations* section (Section 5) that explicitly discusses why CLIP-based models are out of scope for the proposed approach and clarifies the architectural assumptions underlying our method.
---
### Reviewer ZrbZ
- **Layer-wise InfoScore analysis on ALBEF** : Added a layer-wise InfoScore analysis for ALBEF in Figure 8 (Appendix), with accompanying discussion in Section A.2.
- **Additional qualitative results on ADE-20K**: Included additional qualitative results on ADE-20K in Figure 11 (Appendix), with detailed discussion in Section A.4.2.
- **Comparison with 1-shot segmentation methods**: Added a comparison with state-of-the-art 1-shot segmentation approaches in Table 2 and discussed the results in Section 4.2.2.
- **Token aggregation clarification**:Clarified the rationale for using mean aggregation over tokens for each class in Section 3.1.
---
### Reviewer tJRW
- **Terminology correction**: Corrected the terminology throughout the paper by referring to InfoScore as a *measure* rather than a *metric*.
- **Expanded Figure 1 and additional baselines**: Updated Figure 1 to include 1-shot fine-tuning performance when (i) fine-tuning all layers and (ii) randomly selecting the top-2 layers instead of using InfoScore. Corresponding random layer selection results in the 1-shot setting were also added to Table 10.
- **Methodology clarification and formalization**: Revised Section 3.1 to use more formal notation and clearer structure. Added two pseudocode descriptions:
- Algorithm 1: InfoScore computation
- Algorithm 2: Training-free inference
- **Design desiderata for InfoScore**: Explicitly outlined the design desiderata of the InfoScore measure in Section 3.2 to better motivate its formulation.
- **1-shot comparison with SOTA methods**: Added a comparison with state-of-the-art 1-shot segmentation approaches in Table 2 and discussed the results in Section 4.2.2.
- **Minor fixes** : Fixed citation issues and incorporated minor editorial changes suggested by the reviewer.
Assigned Action Editor: ~Mathieu_Salzmann1
Submission Number: 6028
Loading