CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu; Chen-Wei Xie; Bin Wen; Feiwu Yu; JixuanChen; Pandeng Li; Boqiang Zhang; Nianzu Yang; YingluLi; Zuan Gao; Yun Zheng; Hongtao Xie

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, JixuanChen, Pandeng Li, Boqiang Zhang, Nianzu Yang, YingluLi, Zuan Gao, Yun Zheng, Hongtao Xie

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: visual caption, benchmark, MLLM

TL;DR: A new benchmark to evaluate visual caption for MLLMs with considering both correctness and thoroughness.

Abstract: Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/lntzm/CAPability

Code URL: https://github.com/ali-vilab/CAPability

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 351

Loading