Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI

Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI

TMLR Paper4433 Authors

09 Mar 2025 (modified: 13 Aug 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Traditional methods for aligning Large Language Models (LLMs), such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on implicit principles, limiting interpretability. Constitutional AI (CAI) offers an explicit, rule-based framework for guiding LLM alignment. Building on this, we refine the Inverse Constitutional AI (ICAI) algorithm, which extracts constitutions from preference datasets. By improving principle generation, clustering, and embedding processes, our approach enhances the accuracy and generalizability of extracted principles across synthetic and real-world datasets. Our results highlight the potential of these principles to foster more transparent and adaptable alignment methods, offering a promising direction for future advancements beyond traditional fine-tuning.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Inclusion of Sample Constitutions in Appendix B.

Assigned Action Editor: ~Yaodong_Yang1

Submission Number: 4433

Loading