Keywords: LLLM-as-a-Judge, Vision-Language Alignment, Multilingual Evaluations, LLM Evaluations
TL;DR: LLM judge performance depends on complex interactions between language, task type, and model characteristics, not just individual capabilities.
Abstract: Large Language Models (LLMs) as judges have emerged as an important component within the post-training pipeline. The growing popularity of judge LLMs has prompted their evaluation on proxy alignment and reward modelling datasets. Yet, they have not been assessed under combined cross-lingual and multimodal settings. To address this gap, we introduce PolyVis (Polyglot Vision-Language Alignment), a multilingual vision-language alignment benchmark that evaluates judge models under 12 languages and distinct task objectives: hallucinations, safety, knowledge and reasoning. Our findings reveal LLM judge model performance is significantly influenced by composite interactions between task objectives, language and individual model characteristics. These results suggest the need for building tailored evaluation frameworks to challenge each model’s specific capabilities, moving beyond one-size-fits-all approaches that obscure critical performance disparities.
Submission Number: 227
Loading