Hidden in Top-K: Probing Vision-Language Model Understanding Through Post-Retrieval Analysis

Hidden in Top-K: Probing Vision-Language Model Understanding Through Post-Retrieval Analysis

ACL ARR 2025 May Submission6865 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision-language models (VLMs) are often evaluated on linguistic understanding—such as verb recognition or object counting—using handcrafted datasets with contrastive image-caption pairs. However, these datasets rarely capture the full complexity of real-world language use. We propose a probing framework based on post-retrieval analysis, which evaluates a model’s top-K retrievals, which reveals finer-grained weaknesses in model behavior. We evaluate four VLMs—CLIP, BLIP-2, FLAVA, and SigLIP2—on two datasets: SVO-Probes (probing subject-verb-object role understanding) and VALSE-counting (probing numerical comprehension). To mitigate the issue of incomplete retrieval dataset annotations, we complement traditional metrics with three strategies: semantic-similarity success@K, human evaluation, and GPT-4o-based assessment. Our findings show that while VLMs achieve high image-text matching accuracy ($>80\%$), they struggle in top-K retrieval settings—verb and object understanding (success@1 $\approx70\%$), but especially for counting (success@1 $\approx 35\%$). Furthermore, GPT-4o aligns moderately with human judgment on verb but fails on counting tasks. We conclude that standard evaluation methods may underestimate VLM capabilities, and post-retrieval probing offers a more robust and nuanced view of their linguistic understanding.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: image-language models, probing, understanding, verb understanding, counting comprehension

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 6865

Loading