Keywords: Machine Translation, Evaluation
Abstract: We present PEAR (Pairwise Evaluation for Automatic Relative scoring), a supervised QE metric family that reframes reference-free MT evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. PEAR learns from pairwise supervision constructed by differencing human segment-level judgments under an antisymmetry-consistent objective.
On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large WMT submissions, PEAR surpasses far larger QE models and strong reference-based metrics. Inter-metric analyses further indicate that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is a strong utility for Minimum Bayes Risk decoding, and that an antisymmetry-based shortcut reduces pairwise scoring cost with negligible impact.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Machine Translation,Multilingualism and Cross-Lingual NLP,Resources and Evaluation
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English,German,Spanish,Japanese,Chinese
Submission Number: 10553
Loading