Closed-Task Validation: A More Robust and Efficient Proxy for Guiding VLM Training

Enci Zhang; Z.Q. ZHANG; Jiahao Xie; Ruiqi Lu; Boyan Zhou; Cheng Yang

Closed-Task Validation: A More Robust and Efficient Proxy for Guiding VLM Training

Enci Zhang, Z.Q. ZHANG, Jiahao Xie, Ruiqi Lu, Boyan Zhou, Cheng Yang

Published: 12 Nov 2025, Last Modified: 14 Nov 2025VLM4RWD2025 RegularSpotlightEveryoneRevisionsBibTeXCC BY 4.0

Track: Regular papers (within 8 pages excluding appendix)

Keywords: Vision-Language Models, Model Evaluation, Robustness

TL;DR: Standard open-ended VLM validation is slow and unreliable; we show that converting it to a multiple-choice format is >10x faster and yields a stable metric that strongly predicts final performance.

Abstract: Reliable and efficient validation is critical for guiding the resource-intensive process of training Vision-Language Models (VLMs). The standard evaluation paradigm, however, which relies on open-ended text generation, exhibits significant methodological limitations. We empirically demonstrate that this approach is unreliable, yielding high-variance metrics with a negligible correlation (r = 0.061) to final model performance. Furthermore, it is inefficient, as auto-regressive decoding introduces substantial latency and severe load-balancing issues in parallel evaluation. To address these limitations, we propose "Closed-Task" validation, a paradigm that bypasses auto-regressive decoding by converting questions into a multiple-choice format and directly inspecting token probabilities. Our experiments show this method is both highly reliable, producing stable signals strongly correlated (r = 0.798) with final performance, and efficient, achieving a >10x latency reduction with near-perfect load balancing. This work thus provides a robust and efficient validation methodology that resolves the interconnected challenges of evaluation reliability and system efficiency, offering a superior empirical framework for VLM development.

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 7

Loading