Slot Inversion for Asymmetric Composed Image Retrieval

Published: 2025, Last Modified: 22 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Composed Image Retrieval (CIR) is a challenging vision-language (VL) task that retrieves target images using multi-modal (image+text) queries. Although significant progress has been made in existing CIR approaches, their deployment in resource-constrained scenarios remains problematic. To address this issue, we propose a novel framework, named Slot Inversion for Asymmetric Composed Image Retrieval (Slot4ACir), where an asymmetric retrieval scheme is adopted: lightweight models are employed on the query side, while large-scale VL models operate on the gallery side. Specifically, we introduce a lightweight inversion module based on slot attention, which maps an image into multiple textual tokens with distinct semantics. Additionally, the LLM sampler is proposed to facilitate richer semantic interactions, and the distillation alignment (DTA) loss enables the extraction of more informative representations. Extensive experiments on two popular benchmarks demonstrate the effectiveness of our approach. The code is now available at https://github.com/JThuge/Slot4ACir.
Loading