Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

ICLR 2026 Conference Submission13054 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM, Image Understanding, Image Captioning, Vision Language Models, Image Benchmark

Abstract: Large Vision-Language Models (VLMs) now generate highly detailed, paragraph-length image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique achieves state-of-the-art results on external M-HalDetect and CHOCOLATE claim verification datasets, showing strong generalization. (2) On our benchmark, the VNLI-Critique-powered DOCCI-Critique AutoRater's rankings highly correlate with human judgments (e.g., 0.98 Spearman for factuality). (3) A novel Critic-and-Revise pipeline, where VNLI-Critique's critiques guide an LLM to correct caption errors, significantly boosts factuality (e.g., 46% gain on DetailCaps-4870). Our work offers a crucial benchmark and tools to advance fine-grained evaluation and improvement of VLM-based image understanding.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Submission Number: 13054

Loading