Keywords: Multimodal Large Language Models, Verifiers, Digital Agents, Web Agents, GUI Agents, Robotics, Reward Models
Abstract: Verifiers --- functions assigning rewards to agent behavior --- have been key for AI progress in domains such as math and board games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is nontrivial.
Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills.
We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior.
This bias is pervasive across models, resilient to test-time scaling, and can impact existing methods relying on MLLMs as evaluators (e.g., data filtering).
Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior.
To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation.
SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation.
Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory.
Enhanced with SGV, MLLM verifiers show gains of up to 20 percentage points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena
--- setting a new state of the art on the benchmark, surpassing the previous best by 48\%. Finally, we release an updated version of the (Visual)WebArena benchmark featuring more human-aligned evaluators, improved environment parallelization with higher execution fidelity, and runtime speedup of over 10x. Our code, data, and additional visualizations are available at https://self-grounded-verification.github.io
Supplementary Material: zip
Submission Number: 231
Loading