everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Predicting object identity and visual attributes is a fundamental task in many computer vision applications. While large vision-language models such as CLIP had largely solved the task of zero-shot object recognition, zero-shot visual attribute recognition remains challenging because CLIP's contrastively learned language-vision representation does not effectively encode object-attribute dependencies. In this paper, we revisit the problem of attribute recognition and propose a solution using generative prompting, which reformulates attribute recognition as the measurement of the probability of generating a prompt expressing the attribute relation. Unlike contrastive prompting, generative prompting is order-sensitive and designed specifically for downstream object-attribute decomposition. We demonstrate through experiments that generative prompting consistently outperforms contrastive prompting on two visual reasoning datasets, Visual Attribute in the Wild (VAW) and a proposed modified formulation of Visual Genome, which we call Visual Genome Attribute Ranking (VGAR).