M-Prometheus: A Suite of Open Multilingual LLM Judges

José Pombal; Dongkeun Yoon; Patrick Fernandes; Ian Wu; Seungone Kim; Ricardo Rei; Graham Neubig; Andre Martins

M-Prometheus: A Suite of Open Multilingual LLM Judges

José Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, Andre Martins

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: automatic evaluation, llm-as-a-judge, multilinguality

TL;DR: We introduce M-Prometheus, a suite of open-weight multilingual LLM judges ranging from 3B to 14B parameters. M-Prometheus models outperform state-of-the-art open LLM judges.

Abstract: Employing language models as evaluators of long-form output (LLM-as-a-Judge) has become the \textit{de facto} standard for automatic evaluation. However, most LLM judges have been optimized exclusively for English outputs, with strategies for enhancing judges' multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for other languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation evaluation covering 4 language pairs. Furthermore, we find M-Prometheus models can be used with quality-aware decoding methods to significantly improve generated outputs, showcasing their utility for the development of better multilingual models. Crucially, through extensive ablations, we identify key strategies for training an effective multilingual judge. Our findings highlight the significance of model size and base model selection, and the advantages of using natively multilingual data rather than translated data. We release our models, training dataset, and code to reproduce our experiments.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1633

Loading