Fault-Tolerant Preference Alignment via Multi-Agent Verification
Keywords: Preference Alignment, Fault-Tolerant Learning, Large Language Models, Multi-Agent Verification, Preference Poisoning
TL;DR: We propose multi-agent verification to make preference-based alignment more robust by filtering corrupted supervision prior to RLHF/DPO training.
Abstract: Preference-based optimization methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used to align large language models (LLMs). However, these methods implicitly assume that preference supervision is reliable, despite growing evidence that preference data may be noisy, biased, or adversarially manipulated. We introduce MPV, a multi-agent verification framework that filters preference data through specialized verifier agents for factuality, safety, ethics, and trustworthiness prior to optimization. A k-of-n consensus rule admits only supervision approved by multiple heterogeneous agents, yielding a Verified Preference Dataset (VPD). We provide a theoretical analysis showing that consensus-based verification can exponentially suppress corrupted supervision under mild assumptions, while introducing an explicit safety–retention trade-off. Empirically, we evaluate Verified DPO on Qwen2-7B across safety, summarization, factual, and biomedical QA benchmarks. Our results demonstrate consistent reductions in over-refusal behavior and improved robustness under noisy supervision, with dataset-dependent trade-offs in factual coverage and calibration under stricter consensus thresholds. Together, these findings position multi-agent verification as a principled, data-centric complement to existing preference optimization methods, offering an early but promising pathway toward more reliable and trustworthy LLM alignment.
Submission Number: 26
Loading