LARE: Low-Attention Region Encoding for Text--Image Retrieval

Muhammad Kamran J Khan; Abdulmalik Alquwayfili; Faisal AlMeshal; Jumanah Almajnouni; Leena Alotaibi; Huda Abdulhadi Alamri; Raied Aljadaany; Faisal alhajari; Mohammed Alkhrashi; Alreem Almuhrij; Abdullah Aldwyish

LARE: Low-Attention Region Encoding for Text--Image Retrieval

Muhammad Kamran J Khan, Abdulmalik Alquwayfili, Faisal AlMeshal, Jumanah Almajnouni, Leena Alotaibi, Huda Abdulhadi Alamri, Raied Aljadaany, Faisal alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish

Published: 27 May 2026, Last Modified: 01 Jun 2026FMEA @ CVPR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Retrieval, Low Attention, Crowded Scenes

TL;DR: LARE (Low-Attention Region Encoding) encodes low-attention regions and full images in parallel, generating diverse, informative embeddings that enhance text–image retrieval performance.

Abstract: Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose \textbf{LARE} (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce \textbf{Dense-Set}, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

Submission Number: 46

Loading