PEAR: Pairwise Evaluation for Automatic Relative scoring

PEAR: Pairwise Evaluation for Automatic Relative scoring

ACL ARR 2026 January Submission10553 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Translation, Evaluation

Abstract: We present PEAR (Pairwise Evaluation for Automatic Relative scoring), a supervised QE metric family that reframes reference-free MT evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. PEAR learns from pairwise supervision constructed by differencing human segment-level judgments under an antisymmetry-consistent objective. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large WMT submissions, PEAR surpasses far larger QE models and strong reference-based metrics. Inter-metric analyses further indicate that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is a strong utility for Minimum Bayes Risk decoding, and that an antisymmetry-based shortcut reduces pairwise scoring cost with negligible impact.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: Machine Translation,Multilingualism and Cross-Lingual NLP,Resources and Evaluation

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English,German,Spanish,Japanese,Chinese

Submission Number: 10553

Loading