Multi-Modal Primitive Retrieval for Compositional Zero-Shot Learning

Chenchen Jing, Haozhe Zhang, Junbo Lu, Yang Liu, Hao Chen, Xiaoqin Zhang, Chunhua Shen

Published: 20 Feb 2026, Last Modified: 31 May 2026International Journal of Computer Vision (IJCV)EveryoneRevisionsCC BY-NC-ND 4.0

Abstract: Compositional generalization, understanding unseen combinations composed of seen primitives, is one of the fundamental properties of human intelligence. Aiming to evaluate such ability of vision models, compositional zero-shot learning (CZSL) requires recognizing unseen attribute-object compositions by learning from seen compositions. It’s essential for CZSL to compose the learned knowledge of seen primitives, i.e., attributes or objects, into novel compositions. In this work, we propose a retrieval-augmented method to explicitly retrieve knowledge of seen primitives from both vision domain and language domain, for compositional zero-shot learning. Our method augments standard multi-path classification methods with retrieval modules. Specifically, we first construct several databases storing abundant and diverse primitive knowledge, including the attribute and object representations of training images, and textual representations for descriptions of attributes and objects, respectively. For an input training/testing image, we use visual and textual retrieval modules to retrieve representations of relevant training images and text descriptions with the same attribute and object, respectively. The primitive representations and image representation of the input image are augmented by using the retrieved representations, for composition recognition. By referencing semantically similar images and texts, the proposed method is capable of recalling knowledge of seen primitives for compositional generalization. Experiments on three widely used datasets show the effectiveness of the proposed method.