Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang; Wenda Xu; Zhongtao Liu; Tetsuji Nakagawa; Markus Freitag

Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Translation (MT), Quality Estimation, LLM-as-a-Judge

TL;DR: This paper identifies and mitigates a systematic bias in Quality Estimation metrics that causes them to unfairly penalize longer translations affecting multilingual LLMs training.

Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20664

Loading