Explanation Quality Assessment as Ranking with Listwise Rewards

ACL ARR 2026 January Submission10786 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: PPO, Listnet, Preference learning
Abstract: We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single “best” explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels, and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We then use the learned ranking score as a dense reward inside PPO, so the policy receives meaningful advantage signals that reflect relative explanation quality. Across multiple explanation datasets and transfer settings, ranking-based rewards yield better discrimination, faster convergence, and improved explanation quality compared to regression-style reward modeling and generate-then-rerank baselines. Code and models are available at an anonymous repository. \href{anonymous source}{\url{https://anonymous.4open.science/r/PPO_Learning_to_rank-68F2/}}
Paper Type: Short
Research Area: Machine Learning for NLP
Research Area Keywords: NLP explanations, Learning to Rank, Explanation Quality Assessment
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 10786
Loading