Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

Boyin Liu; Zhuo Zhang; Sen Huang; Lipeng Xie; Qingxu Fu; Haoran Chen; LI YU; Tianyi Hu; Zhaoyang Liu; Bolin Ding; Dongbin Zhao

Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao

11 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Reinforcement Learning from AI Feedback, Language Model Alignment

Abstract: Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence—particularly preference cycles ($A \succ B \succ C \succ A$)—has been largely unaddressed. To address this gap, this work introduces an end-to-end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the \textbf{Conflict Detection Rate (CDR)}, a novel metric to quantify judgment conflicts, and \textbf{Deconflicted Graph Rewards (DGR)}, a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. The code for our method is available at \url{https://anonymous.4open.science/status/DGR-E5DA}.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3954

Loading