One Patch, One Text: Sparse Alignment for Closing CLIP's Modality Gap for Compositional Zero-Shot Learning
Keywords: compositional zero-shot learning; zero-shot learning
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions with learned primitives ($e.g.$, attribute and object) knowledge from seen compositions. Previous methods achieve remarkable results by leveraging powerful cross-modal alignment capabilities of CLIP. However, they largely ignore inherent limitations arising from information-imbalanced image-text training data, notably the modality gap. In this work, we propose SAC, a novel $\underline{S}\text{parse}$ $\underline{A}\text{lignment}$ framework to effectively $\underline{C}\text{lose}$ CLIP's modality gap for CZSL. Specifically, we conduct $\textbf{\textit{sparse alignment}}$ between textual representations and their semantically relevant visual patches, which reduces redundant visual information and mitigates information imbalance within image-text pairs. Subsequently, leveraging the reduced visual information of this alignment, the $\textbf{\textit{visual adaptive condensation}}$ module is guided to adaptively condense critical visual cues into a unified representation. Finally, we introduce a $\textbf{\textit{dynamically updated memory bank}}$ that stores samples from both seen and unseen compositions (drawn from historical test data). This design bypasses the modality gap by relying solely on visual classification, while simultaneously improving generalization to unseen compositions. Experiments on three benchmarks demonstrate that our method gains significant improvements over a strong CLIP-based method under closed-world and open-world settings.
Supplementary Material: pdf
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 4975
Loading