Abstract: The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear Reviewers,
Thank you for your patience, we have taken all of your feedback into account. In particular, we have addressed the main concern about the confusing narrative, and have significantly restructured the paper to clarify our main contribution.
We believe that this version comprehensively addresses your concerns but please let us know if you find any issues. We very much appreciate your feedback.
Assigned Action Editor: ~Liang-Chieh_Chen1
Submission Number: 4099
Loading