Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

Shreyansh Padarha; Elizaveta Semenova; Bertie Vidgen; Adam Mahdi; Scott A. Hale

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

Shreyansh Padarha, Elizaveta Semenova, Bertie Vidgen, Adam Mahdi, Scott A. Hale

Published: 24 Sept 2025, Last Modified: 29 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLLM-as-a-Judge, Vision-Language Alignment, Multilingual Evaluations, LLM Evaluations

TL;DR: LLM judge performance depends on complex interactions between language, task type, and model characteristics, not just individual capabilities.

Abstract: Large Language Models (LLMs) as judges have emerged as an important component within the post-training pipeline. The growing popularity of judge LLMs has prompted their evaluation on proxy alignment and reward modelling datasets. Yet, they have not been assessed under combined cross-lingual and multimodal settings. To address this gap, we introduce PolyVis (Polyglot Vision-Language Alignment), a multilingual vision-language alignment benchmark that evaluates judge models under 12 languages and distinct task objectives: hallucinations, safety, knowledge and reasoning. Our findings reveal LLM judge model performance is significantly influenced by composite interactions between task objectives, language and individual model characteristics. These results suggest the need for building tailored evaluation frameworks to challenge each model’s specific capabilities, moving beyond one-size-fits-all approaches that obscure critical performance disparities.

Submission Number: 227

Loading