Keywords: Machine Translation, Evaluation, LLM-as-a-judge
Abstract: Developing reliable machine translation (MT) systems hinges on our ability to distinguish superior translations from inferior ones—but existing evaluation paradigms, whether limited to coarse overall rankings or misaligned with human preferences, fail to deliver interpretable, fine‑grained feedback in reference‑free settings. We present a Fine-Grained Ranking Evaluation method (FiRE) that leverages off‑the‑shelf large language models to perform criterion‑driven pairwise comparison across three complementary dimensions—faithfulness, fluency, and consistency of style—rather than producing a single holistic judgment. To enable rigorous meta‑evaluation of evaluation paradigms in the absence of any suitable testbed, we construct the first human‑annotated, reference‑free benchmark for fine-grained ranking evaluation, achieving substantial inter‑annotator agreement. Through meta‑evaluation on this benchmark, FiRE demonstrably outperforms leading regression‑based and error‑analysis metrics in aligning with human comparative judgments, while providing more informative insights into translation quality. Finally, our examination of LLM evaluator biases (position and self-enhancement) and their handling of tied cases offers guidance for more nuanced MT evaluation.
Primary Area: datasets and benchmarks
Submission Number: 24011
Loading