Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmarking, automatic creation and evaluation of language resources, NLP datasets, metrics
TL;DR: POLLUX: new comprehensive open-source benchmark and family of Judges designed to evaluate the generative capabilities of LLMs in Russian
Abstract: Evaluating open-ended generation remains a highly non-trivial challenge, as responses vary in style, quality, and correctness, making reliable assessment difficult. To address this, we introduce POLLUX, an open-source framework for evaluating Russian-speaking large language models (LLMs). Its novelty lies in a criteria-based methodology that improves interpretability by combining a structured benchmark with a family of LLM-as-a-Judge evaluators. For each task type, we define explicit criteria and a scoring protocol in which models not only rate responses but also justify their judgments, offering a transparent alternative to resource-intensive human comparisons. The benchmark spans 35 task types across domains such as code generation, creative writing, and assistant-style interactions, supported by 2,115 expert-authored prompts stratified by difficulty. In addition, we release specialized evaluators (7B and 32B) trained for fine-grained assessment of generative outputs. By uniting a comprehensive taxonomy with automated judges, POLLUX provides scalable and interpretable evaluation tools that move beyond the costs and inconsistencies of human annotation.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 10840
Loading