Abstract: Compositional zero-shot learning (CZSL) aims to distinguish images from unseen compositional classes, which consist of state and object concepts that individually appear in some seen compositional images. The key challenge of CZSL is how to effectively mitigate the contextuality issue for achieving a desirable compositional transfer from seen classes to unseen ones. In CZSL, the visual appearances of the same state are inconsistent when combined with different objects. To address the above dilemma, we propose a swap-reconstruction autoencoder (SRA) to capture the intrinsic context of the ambiguous states. Specifically, SRA learns a consistent embedding space for multi-modal data. A swap-reconstruction mechanism is designed to disentangle the visual embedding of states and objects. The loss including a superclass-oriented state swap-reconstruction loss and object swap-reconstruction loss model the contextual relationship between states and objects. Extensive experiments demonstrate that SRA outperforms current state-of-the-art methods on the three benchmark datasets.
Loading