Eye of Judgement: Dissecting the Evaluation  of Russian-speaking LLMs with POLLUX

Nikita Martynov; Anastasia Mordasheva; Gorbetskiy Dmitriy; Danil Astafurov; Ulyana Isaeva; Elina Basyrova; Sergey Skachkov; Berestova Victoria; Nikolay Ivanov; Valeriia Zanina; Alena Fenogenova

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov, Anastasia Mordasheva, Gorbetskiy Dmitriy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Berestova Victoria, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarking, automatic creation and evaluation of language resources, NLP datasets, metrics

TL;DR: POLLUX: new comprehensive open-source benchmark and family of Judges designed to evaluate the generative capabilities of LLMs in Russian

Abstract: Evaluating open-ended generation remains a highly non-trivial challenge, as responses vary in style, quality, and correctness, making reliable assessment difficult. To address this, we introduce POLLUX, an open-source framework for evaluating Russian-speaking large language models (LLMs). Its novelty lies in a criteria-based methodology that improves interpretability by combining a structured benchmark with a family of LLM-as-a-Judge evaluators. For each task type, we define explicit criteria and a scoring protocol in which models not only rate responses but also justify their judgments, offering a transparent alternative to resource-intensive human comparisons. The benchmark spans 35 task types across domains such as code generation, creative writing, and assistant-style interactions, supported by 2,115 expert-authored prompts stratified by difficulty. In addition, we release specialized evaluators (7B and 32B) trained for fine-grained assessment of generative outputs. By uniting a comprehensive taxonomy with automated judges, POLLUX provides scalable and interpretable evaluation tools that move beyond the costs and inconsistencies of human annotation.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10840

Loading