Keywords: fine-grained evaluation, vlm-as-a-judge, vision-language model
TL;DR: We are the first to train a vision-language model specifically designed for fine-grained evaluation, performing on par with GPT-4V evaluation.
Abstract: Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For this purpose, we present a new multi-modal feedback dataset called the Perception Collection, encompassing 15K customized score rubrics that users might care about during assessment. Using the Perception Collection, we train Prometheus-Vision, the first open-source VLM specialized for fine-grained evaluation purposes. Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among the open-source VLM baselines, showing its effectiveness for transparent and accessible evaluation. We open-source our code, dataset, and model at https://anonymous.4open.science/r/prometheus-vision-9D37.
Submission Number: 16
Loading