Rethinking Pseudo Word Learning in Zero-Shot Composed Image Retrieval: From an Object-Aware Perspective

Published: 01 Jan 2025, Last Modified: 16 Jul 2025SIGIR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Composed Image Retrieval (CIR) takes a composed query of a reference image and a text describing the user's intention, with the aim to retrieve the target image under both conditions. Conventional CIR approaches heavily rely on massive annotated triplets, which often comes at a considerable cost. Zero-Shot CIR (ZS-CIR) offers a new solution that can perform diverse CIR tasks without training on the triplet datasets. The key to the ZS-CIR task is to make specified changes to specific objects in the reference image based on the text. Previous works utilize a projection module to map the reference image into single or multiple pseudo words. However, they are either only applicable to single-object scenarios, or naively convert entire image features into multiple pseudo words and fail to focus on the desired target objects specified by the text description. In this work, we rethink how to learn pseudo words based on the objects attended by the text and propose a Multi-Object Aware ZS-CIR framework (MOA). Specifically, a multi-object recognizer first recognizes valid objects in the reference image guided by a set of learnable object queries. Then, we devise an object filtering strategy, which utilizes contextual prompts comprised of noun categories to guide the model in precisely screening out the objects that need to be modified. Finally, the pseudo word learning branch adaptively converts the screened objects into multiple pseudo words for accurate ZS-CIR. Although simple, our MOA consistently outperforms previous state-of-the-art methods across diverse benchmarks and even achieves competitive results with many supervised methods.
Loading