Keywords: Video Language Models, Model evaluation, Collective thought, Reliability
Abstract: Evaluating video language models (VLMs) is crucial for improving their understanding of video content.
Existing evaluation methods depend on single models,
which may be unreliable or biased due to the models' incapability to understand content or inherent bias,
ultimately compromising the reliability of evaluation.
A straightforward remedy is to apply the principle of collective thoughts,
aggregating reviews from multiple VLMs to enhance reliability.
This study investigates the efficacy of such approaches in VLM evaluation,
particularly when the pool of judges includes both reliable and unreliable models.
Our findings reveal that incorporating collective judgments from such a mixed pool
does not necessarily enhance the accuracy of the final evaluation outcomes,
because the less reliable judges could introduce noise that
potentially leads to less reliable evaluations.
To explore the factors that impact evaluation reliability,
we fine-tune an underperforming VLM judge, Video-LLaVA, and observe that
good understanding ability alone is insufficient
to make VLM judges reliable.
These findings stress the limitations of collective thought approaches in VLM evaluation and
highlight the need for more advanced methods that can account for the reliability of individual models.
Our study promotes the development of more reliable evaluation methods for VLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12195
Loading