Abstract: Zero-shot Cross-Modal Retrieval (ZS-CMR) is challenging due to the heterogeneous distributions across different modalities and the inconsistent semantics across seen and unseen classes. Previous methods usually perform class-level semantic alignment of data from different modalities by introducing auxiliary word embeddings of class labels, which have a fatal limitation as the learning of class-level information will lead to the ignorance of intra-modal variance. To solve this problem, we propose our Instance-Level Semantic Alignment (ILSA) method to make full use of the instance-level information. We use two disentanglement variational auto-encoders to decompose the data from two modalities into modal specific and modal invariant features. With an instance-level semantic features extractor and a distribution generator, ILSA could generate more appropriate distributions by the learned instance-level semantic features, without any auxiliary knowledge. We perform the experiment on six widely used datasets on two scenarios of ZS-CMR, the results show that our method establishes the new state-of-the-art performance on all datasets.
0 Replies
Loading