Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

ACL ARR 2026 January Submission10032 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Safety, Safety Control, LLM Safety helpfulness tradeoff, Safety alignment, Supervised Fine-Tuning (SFT), Controllable Generation

Abstract: Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllability and interpretability. For instance, queries about morphine's medical use versus fentanyl synthesis instructions require fundamentally different responses, yet current methods often fail to distinguish such contexts, leading to either over-refusal or under-constraint. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violence) that cannot be modified by users, while user-defined policies enable per-category action customization for domain-specific needs. The framework decomposes safety decisions into structured reasoning paths that classify risks and map them to configurable actions (comply, guide, or reject), providing transparency while guaranteeing safety integrity. Experiments demonstrate that PACT achieves COSA scores of \textbf{0.201} (vs. 0.011 for the base model) while maintaining \textbf{95.9\%} safety rate (vs. 70.8\% for the base model), effectively mitigating the safety-helpfulness trade-off. We release PACT models, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: chain-of-thought,fine-tuning,safety and alignment

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 10032

Loading