Understanding Before Evaluation: A Reliable Framework for Assessing Non-Deterministic Machine Translation Systems

20 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: non-deterministic, machine translation, automatic evaluation
Abstract: Modern machine translation (MT) systems exhibit non-deterministic behavior, producing variant outputs across runs in both neural MT and LLM-based MT. This variability poses significant challenges for automatic evaluation methods (AEMs), leading to unreliable quality assessments. To address this limitation, we propose a two-stage **"Understanding Before Evaluation"** framework. In the understanding stage, we formalize and measure the degree of non-determinism from both lexical and semantic perspectives using a simple sample-based strategy. Comprehensive experiments on public datasets reveal high variance in lexical-based methods while demonstrating stable behavior in semantic-based approaches across MT systems. In the evaluation stage, we propose a reliable *ExpectoSample* method that explicitly incorporates non-deterministic characteristics to mitigate variance effects. Our two-stage framework delivers more reliable quality assessments for modern MT systems. Furthermore, our methods provide a potential way for measuring MT metrics without human involvement and highlight the superiority of semantic-based metrics for evaluating modern non-deterministic MT systems.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 24063
Loading