Abstract: Recent machine translation (MT) agents rely on large language models (LLMs) as judges, typically using coarse prompts that jointly assess all types of error.
We propose a framework that decomposes the evaluation into multiple expert evaluators, each specializing in a specific error dimension with detailed criteria, examples, and external knowledge.
Based on a set of dimension-specific feedbacks, we tested and analyzed translation refinement using sequential, parallel, and comprehensive strategies.
Experiments with both small and large models show that combining specialized evaluators outperforms a single holistic judge by more effectively capturing fine-grained errors. Our findings highlight the benefits of decomposing complex evaluation for more effective self-refinement of LLM.
Furthermore, by using smaller, open-source LLMs, our approach achieves strong performance with significantly reduced computational cost, making robust translation evaluation more accessible. This work opens new avenues for scalable, modular quality control in automated translation systems.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Machine Translation, Agent
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English, German, Chinese
Submission Number: 6973
Loading