Abstract: Do vision-language models (VLMs) pre-trained to caption an image of a durian learn visual
concepts such as brown (color) and spiky (texture) at the same time? We aim to answer
this question as visual concepts learned “for free” would enable wide applications such as
neuro-symbolic reasoning or human-interpretable object classification. We assume that the
visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language
interface with text-based concept prompts. We observe that recent works prompting VLMs
with concepts often differ in their strategies to define and evaluate the visual concepts,
leading to conflicting conclusions. We propose a new concept definition strategy based on
two observations: First, certain concept prompts include shortcuts that recognize correct
concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness,
and textual knowledge) should be leveraged when selecting the concepts. Our proposed
concept discovery and learning (CDL) framework is thus designed to identify a diverse list
of generic visual concepts (e.g. spiky as opposed to spiky durian), which are ranked and
selected based on visual and language mutual information. We carefully design quantitative
and human evaluations of the discovered concepts on nine diverse visual recognition datasets,
which confirm that pre-trained VLMs do learn visual concepts that provide accurate and
thorough descriptions for the recognized objects. All code and models are publicly releas
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~David_Fouhey2
Submission Number: 3332
Loading