Understanding and Steering Large Generative Models Using Constitutions for Atomic Concept Edits
TL;DR: We interpret model understanding and behavior by leveraging structured prompt mutation and uncover useful insights
Abstract: We introduce an interpretability framework that learns a verifiable constitution: a natural language summary of how specific changes to a prompt affect a model’s behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the text. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, demonstrating its ability to enforce word counts, perform active testing for mathematical reasoning, and adversarially steer text-to-image alignment.
Submission Number: 1509
Loading