Abstract: Vision-language models (VLMs) are often evaluated on linguistic understanding—such as verb recognition or object counting—using handcrafted datasets with contrastive image-caption pairs. However, these datasets rarely capture the full complexity of real-world language use. We propose a probing framework based on post-retrieval analysis, which evaluates a model’s top-K retrievals, which reveals finer-grained weaknesses in model behavior. We evaluate four VLMs—CLIP, BLIP-2, FLAVA, and SigLIP2—on two datasets: SVO-Probes (probing subject-verb-object role understanding) and VALSE-counting (probing numerical comprehension). To mitigate the issue of incomplete retrieval dataset annotations, we complement traditional metrics with three strategies: semantic-similarity success@K, human evaluation, and GPT-4o-based assessment. Our findings show that while VLMs achieve high image-text matching accuracy ($>80\%$), they struggle in top-K retrieval settings—verb and object understanding (success@1 $\approx70\%$), but especially for counting (success@1 $\approx 35\%$). Furthermore, GPT-4o aligns moderately with human judgment on verb but fails on counting tasks. We conclude that standard evaluation methods may underestimate VLM capabilities, and post-retrieval probing offers a more robust and nuanced view of their linguistic understanding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image-language models, probing, understanding, verb understanding, counting comprehension
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 6865
Loading