Focusing on Representation of Multi-head Attention for Open-Vocabulary Semantic Segmentation

Published: 01 Jan 2025, Last Modified: 14 May 2025ICEIC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the advancement of large vision-language models, impressive results have been achieved in open-vocabulary tasks that require strong generalization ability. However, in pixel-level dense prediction tasks, recognizing the local information of objects remains challenging. This is because query-key multiplication in self-attention is not well-suited for dense prediction. Instead, significant performance improvements have been made by using self-self attention among queries, keys, and values (q-q, k-k, v-v). However, research on the impact of each multi-head attention mechanism on dense prediction remains scarce. In this paper, we decompose multi-head attention (MHA) in self-attention and investigate the effectiveness of each attention head on the performance of open-vocabulary semantic segmentation. Each attention head is responsible for different representations, such as number, shape, and color. Through our research, we found that certain attention heads of CLIP are more suited for dense prediction tasks. Finally, we achieve significant performance improvements over the current state-of-the-art model, SCLIP, by selecting specific attention heads in the last layer of CLIP's visual encoder across three benchmark datasets: VOC21, Context60, and COCO-Object.
Loading