TL;DR: We present Situated-PRInciples (SPRI), a framework that automatically generates constitutional principles tailored to each input instance and uses them to align responses.
Abstract: Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness.
Lay Summary: Large language models (like ChatGPT) are powerful but often need guidance to behave in ways that align with human values, especially for sensitive or complex tasks. Traditionally, this guidance comes from predefined rules or expert feedback — a process that’s slow, costly, and hard to personalize.
Our work introduces a new approach called SPRI (Situated-PRInciples), which can automatically create custom guiding principles for each situation or question, without relying on humans to write them. Imagine a virtual assistant that doesn’t just follow a fixed set of rules, but figures out the best rules for the moment — all on its own.
We tested SPRI in several tasks and found that it performs as well as expert-written rules. It also improves how language models judge and respond to complex queries, leading to more truthful and context-sensitive outputs.
By making AI more adaptable and principled without human labor, SPRI takes a step toward scalable, value-aligned systems that can reason responsibly in diverse real-world scenarios.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/honglizhan/SPRI-public
Primary Area: Social Aspects->Alignment
Keywords: Large Language Models, Alignment, Scalable Context-Situated Oversight
Submission Number: 8545
Loading