C$^3$AI: Crafting and Evaluating Constitutions for Constitutional AI

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Responsible Web
Keywords: Constitutional AI, Human-AI Alignment, Responsible AI
Abstract: As large language models (LLMs) become more integrated into daily life, ensuring they align with human values is crucial for both safety and transparency. Constitutional AI (CAI) offers a novel approach to self-aligning LLMs by using sets of principles, referred to as constitutions. While this method is elegant in its ability to self-supervise without the need for costly human annotations, uncertainty remains about how to create effective constitutions and evaluate the models based on them. Specifically, it is unclear to what extent a CAI model adheres to specific principles within its constitution and how differences in these constitutions affect the model’s overall behavior. To address this, we propose our C$^3$AI framework, which utilizes a pairwise preference evaluator to craft more effective constitutions. By incorporating insights from both AI and psychology, we evaluate a diverse set of principles using a network psychometric approach, constructing a constitutional principle graph to identify the most informative principles. Our findings reveal that the degree of principle-human agreement varies across different principles and conversational categories (such as harmless, helpful, and general conversations). For instance, principles that emphasize respecting human rights, unsurprisingly, show higher human agreement on harmlessness. We then apply our graph-based principle selection method in a safety alignment use case and compare it to previous CAI approaches without principle selection. We found that fine-tuned CAI models tend to perform well on negatively framed principles (e.g., minimizing aggression) but perform worse on positively framed principles (e.g., those focused on benefiting humanity). Compared to prior work, our principle-selection-based fine-tuned model performs better on safety measures while maintaining competitive performance in terms of general capabilities. Overall, C$^3$AI provides a systematic and transparent approach to developing constitutional AI models, laying a foundation for more reliable and ethical LLM alignment.
Submission Number: 2137
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview