Understanding and Steering Large Generative Models Using Constitutions for Atomic Concept Edits

Neha Kalibhat; Zi Wang; Prasoon Bajpai; Drew Proud; Wenjun Zeng; Been Kim; Mani Malek

Understanding and Steering Large Generative Models Using Constitutions for Atomic Concept Edits

Neha Kalibhat, Zi Wang, Prasoon Bajpai, Drew Proud, Wenjun Zeng, Been Kim, Mani Malek

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We interpret model understanding and behavior by leveraging structured prompt mutation and uncover useful insights

Abstract: We introduce an interpretability framework that learns a verifiable constitution: a natural language summary of how specific changes to a prompt affect a model’s behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the text. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, demonstrating its ability to enforce word counts, perform active testing for mathematical reasoning, and adversarially steer text-to-image alignment.

Submission Number: 1509

Loading