Visual primitives as words: Alignment and interaction for compositional zero-shot learning

Feng Shuang, Jiahuan Li, Qingbao Huang, Wenye Zhao, Dongsheng Xu, Chao Han, Haonan Cheng

Published: 2025, Last Modified: 11 Apr 2025Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We present a novel perspective that a visual primitive can be regarded as a word.•We propose VisPrompt for interacting visual primitives with sub-concepts in a prompt.•VisPrompt can easily align visual elements with text and interact vision with text.•We propose multi-level visual prompts to achieve fine-grained alignment.