Keywords: Vision-Language Models; World Models; Counterfactual Evaluation; Judge Stability
TL;DR: We introduce GridWM-Judge, revealing that VLM judges' robustness failures stem from world-model deficits, decoupling local accuracy from judgment stability.
Abstract: Vision–language models (VLMs) are increasingly used as automated judges to score agent trajectories, yet their success/fail verdicts are brittle to benign changes in wording, evidence order, or rendering. This instability may reflect a deeper world-model deficit: current VLM judges often lack reliable action-conditioned state tracking. We introduce GridWM-Judge, a diagnostic benchmark built on six deterministic MiniGrid environments that generates physically consistent Full/NoCue/Counterfactual trajectory triplets via deterministic planning and minimal interventions. It decomposes evaluation into three tasks: (A) 4-choice atomic next-observation prediction, (B) structured scene perception as canonical JSON, and (C) success/fail judging from rendered storyboards. We quantify reliability via Judgment Consistency Rate and Flip Rate under controlled framing, temporal, and visual-attribute probes, alongside accuracy and correlation analysis linking atomic prediction to judgment stability. Experiments across 13 VLMs reveal a fragility paradox: higher atomic transition accuracy does not necessarily yield more stable judgments, and can even correlate negatively with stability under temporal and visual probes. This reflects a decoupling between accuracy and robustness: weaker models rely on near-constant default verdicts, while stronger models engage in sensitive state tracking that is brittle to non-semantic perturbations. Robustness failures cannot be captured by accuracy alone; GridWM-Judge diagnoses how world-model competence relates to judge reliability.
Project repository: https://github.com/Lucas-Jin-Qh/GridWM_Judge
Supplementary Material: zip
Submission Number: 49
Loading