Keywords: Inverse Constitutional AI, Human Preference Alignment, Adversarial Debate, LLM-as-a Judge
Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the considerations that shape preferences. Inverse Constitutional AI (ICAI) improves interpretability in decision making by summarizing preferences into natural-language principles, but its single-pass explanations miss much of the nuance involved in complex decisions. We introduce Democratic ICAI, a novel approach that gathers multiple competing rationales through structured persona debate, offering a broader and more expressive account of the factors influencing each comparison. From these richer signals, we derive clearer and more comprehensive steering principles and use them to guide decision modeling through both LLM-based and decision-tree judges. Experiments on creative-writing preferences show that Democratic ICAI uncovers deeper and more nuanced preference structure than ICAI alone, providing a more coherent and reliable foundation for aligned model behavior.
Paper Type: New Full Paper
Submission Number: 73
Loading