Keywords: LLM Security, LLM Safety, Instruction Hierarchy, Steerability, Pluralistic Alignment
Abstract: Instruction Hierarchy (IH), the structured prioritization of system prompts over user prompts, has emerged as a key security mechanism for language models (LMs). Despite its importance for flexible steering and robust safety control, current LMs offer limited support and often fail to enforce system-level specifications when these conflict with user instructions. In this work, we introduce HieraSuite, a full-stack toolkit for building steerable and secure system-user IH for LMs. HieraSuite encompasses four key components: (1) HieraInstruct, a large-scale and diverse collection of 221K system–user instruction pairs spanning four real-world application domains (system constraints, privacy and security, steerability, and task execution); (2) HieraConsReasoner, an effective and compact reasoner model, paired with training data, that elicits contextualized rubrics to specify what constitutes valid responses under hierarchical instructions; (3) HieraCRO, an iterative response optimization approach, grounded in constitutional rubrics, that enhances LM compliance with instruction hierarchy; and (4) HieraBench, a unified benchmark that integrates ten tasks to assess controllability, steerability, customizability, and security of system-user instruction hierarchy. Together, these components form an end-to-end solution that yields consistent gains across model families and scales, including up to 66.9% improvements on HieraBench tasks and over 306.3% gains in overriding conflicting user instructions. Systematic testing of alignment recipes further identifies design choices that balance user instruction-following, system instruction-override, and general capabilities. This work provides a principled framework and practical toolkit for LM user-system instruction hierarchy, laying the foundation for future studies on “instruction un-following” and advancing steerability and security in LM alignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14282
Loading