Keywords: Image Retrieval, Low Attention, Crowded Scenes
TL;DR: LARE (Low-Attention Region Encoding) encodes low-attention regions and full images in parallel, generating diverse, informative embeddings that enhance text–image retrieval performance.
Abstract: Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose \textbf{LARE} (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings.
To evaluate image retrieval performance in challenging crowded scenes, we introduce \textbf{Dense-Set}, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions.
Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.
Submission Number: 46
Loading