Keywords: pluralistic alignment, preference data, RLHF, reward model evaluation, LLM-as-judge
Abstract: UltraFeedback (UF) preference labels are widely used to align open chatbots and are trusted as a signal of answer quality — yet an independent, safety-trained reward model disagrees with them more often than it agrees. Each binarized "chosen/rejected" label comes from a single opaque holistic GPT-4 score rather than from UF's four quality axes. We test whether that label is a universal quality signal or one particular value resolution by re-scoring the same pairs under independent reward rubrics and measuring agreement with UF's ordering. Of all 19 ArmoRM heads, only the BeaverTails-safety head (trained on a UF-disjoint corpus) disagrees more than chance, agreeing just 43.5% of the time on n=5,000 pairs (a 56.5% inversion, below the 50% null); every other head and an independent reward model (Skywork-Reward-V2-8B, no UF in training) reproduce UF's ordering at 55–77%. The inversion survives length and refusal controls, holds across benign and safety-relevant prompts, and replicates on an independently built dataset (Skywork-Reward-80K), so it is not UF-specific — though not universal: on a third dataset (HelpSteer2) the apparent inversion does not survive controls. We read this narrowly as one legitimate rubric, the safety objective, reordering UF's preferences — rubric disagreement, not a proven helpful-vs-safe tradeoff and not multi-position pluralism.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 133
Loading