Track: Regular papers (within 8 pages excluding appendix)
Keywords: Vision-Language Models, Model Evaluation, Robustness
TL;DR: Standard open-ended VLM validation is slow and unreliable; we show that converting it to a multiple-choice format is >10x faster and yields a stable metric that strongly predicts final performance.
Abstract: Reliable and efficient validation is critical for guiding the resource-intensive process of training Vision-Language Models (VLMs). The standard evaluation paradigm, however, which relies on open-ended text generation, exhibits significant methodological limitations. We empirically demonstrate that this approach is unreliable, yielding high-variance metrics with a negligible correlation (r = 0.061) to final model performance. Furthermore, it is inefficient, as auto-regressive decoding introduces substantial latency and severe load-balancing issues in parallel evaluation. To address these limitations, we propose "Closed-Task" validation, a paradigm that bypasses auto-regressive decoding by converting questions into a multiple-choice format and directly inspecting token probabilities. Our experiments show this method is both highly reliable, producing stable signals strongly correlated (r = 0.798) with final performance, and efficient, achieving a >10x latency reduction with near-perfect load balancing. This work thus provides a robust and efficient validation methodology that resolves the interconnected challenges of evaluation reliability and system efficiency, offering a superior empirical framework for VLM development.
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 7
Loading