Trust but Verify: Programmatic VLM Evaluation in the Wild

Viraj Uday Prabhu; Senthil Purushwalkam; An Yan; Caiming Xiong; Ran Xu

Trust but Verify: Programmatic VLM Evaluation in the Wild

Viraj Uday Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu

13 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, evaluation, hallucinations

TL;DR: Reliable in-the-wild VLM benchmarking via programmatic verification & evaluation

Abstract: Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to _verify_ each QA pair. We thus construct a benchmark of 10.5k challenging but grounded visual QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a _programmatic_ evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 559

Loading