Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems

Haoquan Zhang, Ronggang Huang, Yi Xie, Huaidong Zhang

Published: 2024, Last Modified: 13 May 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In Visual Question Answering (VQA), recognizing and localizing entities pose significant challenges. Pretrained vision-and-language models have addressed this problem by providing a text description as the answer. However, in visual scenes with multiple entities, textual descriptions struggle to distinguish the entities from the same category effectively. Consequently, the VQA dataset is limited by the limitations of text description and cannot adequately cover scenarios involving multiple entities. To address this challenge, we introduce a Mask for Align (Mask4Align) method, which can determine the entity's position in the given image that best matches the user-input question. This method incorporates colored masks into the image, enabling the VQA model to handle discrimination and localization challenges associated with multiple entities. To process an arbitrary number of similar entities, Mask4Align is designed hierarchically to discern subtle differences, achieving precise localization. Since Mask4Align directly utilizes pre-trained models, it does not introduce additional training overhead. Extensive experiments conducted on both the gaze target prediction task dataset and our proposed multi-entity localization dataset showcase the superiority of Mask4Align. Code and dataset are available at https://github.com/haoquanzhang/mask4align.