Keywords: Multimodal LLMs, Spatial Reasoning, Evaluation
TL;DR: We diagnose and correct the influence of task-agnostic factors on VSI-Bench scores via FN-VSI, a factor-normalized score that re-ranks VLMs differently from raw aggregate accuracy.
Abstract: We introduce FN-VSI, a factor-normalized score for VSI-Bench based on an information-theoretic diagnostic, which substantially re-ranks vision-language models (VLMs). Spatial reasoning benchmarks for VLMs are typically reported as aggregate task scores, but such scores are not determined by model ability alone. VSI-Bench performance is strongly affected by task-agnostic factors such as scene source, object category, ground-truth answer value, and low-level visual properties of the input video, none of which are the intended targets of evaluation. Since these factors are imbalanced and mutually entangled, raw score differences conflate spatial reasoning ability with benchmark composition. To disentangle their effects, we introduce FST, a diagnostic that estimates each factor's contribution by comparing its marginal and conditional mutual information with model scores, classifying factors as direct contributors, surface correlates, suppressed effects, or negligible factors. Across VSI-Bench tasks and multiple VLMs, ground-truth answer value and queried object type emerge as strong direct contributors, while several apparent sensitivities disappear after adjustment. Building on this diagnosis, FN-VSI reweights benchmark instances to neutralize the genuine contributors identified by FST, indicating that reported spatial reasoning gains can depend on the factor mixture of the benchmark rather than on genuine ability.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 146
Loading