OLIIV Benchmark: Does Your VLM Care What You Say, or How You Say It?

Christopher Curtis; Victor Fragoso; Chirag Malhotra; Saiph Savage

OLIIV Benchmark: Does Your VLM Care What You Say, or How You Say It?

Christopher Curtis, Victor Fragoso, Chirag Malhotra, Saiph Savage

18 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, VLM, Benchmarking, Dataset, Instruction Following

Abstract: Recent advances in vision-language models (VLMs) have improved their ability to perform multimodal reasoning. However, their capacity to consistently follow answer-specification instructions—explicit directives about how responses should be formatted, structured, or composed—remains largely unexplored. This ability is critical for improving user experience and enabling fair and reliable comparisons across models. To evaluate answer specification instruction following, we introduce OLIIV (Open Language-Image Input Variation), a benchmark designed to measure compliance with answer-specification instructions. OLIIV spans four representative task types: multiple-choice reasoning, binary question answering, structured output generation in JSON, YAML, and XML, and length-constrained image captioning. Each task is tested under systematically varied prompt formulations. This is to assess whether models maintain instruction following compliance when input phrasing changes, but the underlying task remains fixed. Results show that many models perform inconsistently across superficially different, but semantically equivalent, prompts. We found that models often behave differently when presented with Roman numerals versus letters in multiple-choice questions, or produce more compliant Yes/No answers than True/False ones—despite identical instructions. To evaluate adherence to length constraints, we introduce the Length Infidelity Score (LIS)—a deterministic, model-agnostic metric for quantifying over- or under-length responses. Structured-output evaluation further shows that models frequently produce syntactically correct but structurally invalid outputs, such as inserting empty fields or omitting required schema elements. Taken together, our findings reveal that all VLM's used in our experiment are highly sensitive to prompt variation. Such sensitivity limits the fairness of current benchmarking methods, OLIIV fills this gap by providing a structured framework to explicitly test how robust VLMs are to semantically equivalent variations in prompts.

Primary Area: datasets and benchmarks

Submission Number: 13474

Loading