Abstract: Large-scale Vision-Language Models (LVLMs) can process both images and text, demonstrating advanced capabilities in multimodal tasks like image captioning and visual question answering (VQA).
However, it remains unclear whether they have an ability to understand and evaluate images, particularly in capturing the nuanced impressions and evaluations.
To address this, we propose an image review evaluation method using rank correlation analysis.
Our method asks a model to rank five review texts for an image.
We then compare the model's rankings with human rankings to measure correlation.
This enables effective evaluation of review texts that do not have a single correct answer.
We validate this approach with a benchmark dataset of images from 15 categories, each with five review texts and annotated rankings in English and Japanese, resulting in over 2,000 data instances.
Our experiments show that LVLMs excel at distinguishing between high-quality and low-quality reviews.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Vision and Language, Large-scale vision language models, Multimodal, Review Texts, Evaluation Metrics
Contribution Types: Data resources
Languages Studied: Vision and Language, Large-scale vision language models, Multimodal, Review Texts, Evaluation Metrics
Submission Number: 5054
Loading