Keywords: Video Language Models, Model evaluation, Reliability
Abstract: As video language models (VLMs) gain more applications in various scenarios,
the need for robust and scalable evaluation of their performance becomes increasingly critical. The traditional human expert-based evaluation of VLMs has limitations in consistency and scalability, which sparked interest in automatic methods
such as employing VLMs to evaluate VLMs. However, the reliability of VLMs as
judges remains underexplored. Existing methods often rely on a single VLM as
the evaluator. However, this approach can be unreliable or biased because such a
model may lack the ability to fully understand the content and may have inherent
biases, ultimately compromising evaluation reliability. A remedy is to apply the
principle of collective thoughts, aggregating evaluations from multiple VLMs to
enhance reliability. This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Our
findings reveal that incorporating collective judgments from such a mixed pool
does not necessarily improve the accuracy of the final evaluation. The inclusion of
less reliable judges can introduce noise, undermining the overall reliability of the
outcomes. To explore the factors that impact evaluation reliability, we fine-tune
an underperforming VLM judge, Video-LLaVA, and observe that improved understanding ability alone is insufficient to make VLM judges more reliable. These
findings stress the limitations of collective thought approaches and highlight the
need for more advanced methods that can account for the reliability of individual
models. Our study promotes the development of more reliable evaluation methods
for VLMs
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12195
Loading