Keywords: VLM Hallucinations, Benchmarking, Reliability, Program Synthesis
Abstract: Vision-Language Models (VLMs) often generate plausible but incorrect responses to image-related queries. Quantifying the effect of such hallucinations requires verification of all claims in a generated free-form response. This can be accomplished by having i) grounded free-form questions that are unambiguously answerable from the provided image and ii) exhaustive ground truth scene information to verify each generated claim. In this work, we propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm that builds a high-fidelity scene graph representation from a hyper-detailed image caption. Using a Large Language Model (LLM), we generate diverse and challenging question-answer pairs. Additionally, we synthesize verification programs that accept the scene graph as an input and can be executed to verify a given response. We thus construct a benchmark of 10k open-ended but verifiable and visually grounded QA pairs. PROVE facilitates close-ended evaluation of VLM responses by comparing scene graphs, thereby circumventing the need for comparing free-form language responses. Specifically, we measure the VLM response’s helpfulness by computing (graph) recall and its truthfulness as its (graph) precision against the exhaustive ground truth caption. We extensively benchmark existing VLMs on PROVE and demonstrate its superiority to competing benchmarks.
Submission Number: 1
Loading