Local-Level Feature Aggregation with Attribute Anchors for Text-Guided Image Retrieval

Published: 2025, Last Modified: 08 Jan 2026ICAIIC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-guided image retrieval (TGIR) aims to retrieve appropriate target images based on user feedback for a reference image. Existing methods employ global-level representations to model changes in the query by combining global feature vectors from the reference image and feedback text. However, these methods have limitations in capturing local image changes indicated by attribute words in the feedback text, as they do not actively address these local changes during the query combination process. To address this limitation, we propose a novel local-level feature aggregation (LFA) module and training strategy accompanied by a newly defined loss function. In the LFA module, we introduce a set of trainable attribute anchors to aggregate local features of the image and text in the semantic space. These aggregated local features effectively represent local changes in the query and target images from the perspective of multiple attribute anchors. In addition, the LFA module can be easily integrated with existing global-level feature representation modules which play complementary roles in image retrieval. We validate the effectiveness of our proposed method on two benchmark datasets, achieving considerable performance improvement.
Loading