Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge for Mathematical and Scientific Reasoning Evaluation
Code Url: https://perception-judge.github.io/
Keywords: MLLM-as-a-Judge, Judgement Bias
TL;DR: We identify and mitigate Perceptual Judgment Bias in multimodal LLM judges using a perception-guided training framework with structured batch reward modeling, significantly improving perceptual fidelity and human alignment.
Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate reasoning over mathematical diagrams, scientific figures, and charts, where reliable visual grounding is essential for verifiable assessment.
Yet when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers, a critical failure in evaluating mathematical and scientific reasoning.
We identify this phenomenon as Perceptual Judgment Bias: judges anchor on response text rather than their own visual perception, producing inconsistent and non-verifiable evaluations.
To address this, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision.
Building on it, we train judges with a verifiable batch-ranking reward via GRPO, yielding coherent global ordering without explicit pairwise labels.
On MLLM-as-a-Judge benchmarks including MathVista, ScienceQA, ChartQA, InfographicVQA and MM-Vet, our method substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation, with the largest gains on mathematical and scientific reasoning tasks.
Submission Number: 205
Loading