Abstract: Language-vision models like CLIP have made significant progress in zero-shot vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains a challenge as current methods produce descriptions that lack granularity and are ambiguous. To tackle these challenges, we propose V-GLOSS: Visual Glosses, a novel method that prompts language models with semantic knowledge to produce improved visual descriptions. We demonstrate that V-GLOSS can be used to achieve state-of-the-art results on benchmark ZSIC datasets, such as ImageNet and STL-10. In addition, we introduce a silver dataset with visual descriptions generated by V-GLOSS and demonstrate its utility for language-vision tasks.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Languages Studied: English
0 Replies
Loading