Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Keywords: VLM, Image Understanding, Image Captioning, Vision Language Models, Image Benchmark
Abstract: Large Vision-Language Models (VLMs) now generate highly detailed, paragraph-length image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique achieves state-of-the-art results on external M-HalDetect and CHOCOLATE claim verification datasets, showing strong generalization. (2) On our benchmark, the VNLI-Critique-powered DOCCI-Critique AutoRater's rankings highly correlate with human judgments (e.g., 0.98 Spearman for factuality). (3) A novel Critic-and-Revise pipeline, where VNLI-Critique's critiques guide an LLM to correct caption errors, significantly boosts factuality (e.g., 46% gain on DetailCaps-4870). Our work offers a crucial benchmark and tools to advance fine-grained evaluation and improvement of VLM-based image understanding.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 13054
Loading