Keywords: LLM Safety, Safety Control, LLM Safety helpfulness tradeoff, Safety alignment, Supervised Fine-Tuning (SFT), Controllable Generation
Abstract: Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllability and interpretability. For instance, queries about morphine's medical use versus fentanyl synthesis instructions require fundamentally different responses, yet current methods often fail to distinguish such contexts, leading to either over-refusal or under-constraint.
We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violence) that cannot be modified by users, while user-defined policies enable per-category action customization for domain-specific needs. The framework decomposes safety decisions into structured reasoning paths that classify risks and map them to configurable actions (comply, guide, or reject), providing transparency while guaranteeing safety integrity.
Experiments demonstrate that PACT achieves COSA scores of \textbf{0.201} (vs. 0.011 for the base model) while maintaining \textbf{95.9\%} safety rate (vs. 70.8\% for the base model), effectively mitigating the safety-helpfulness trade-off. We release PACT models, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: chain-of-thought,fine-tuning,safety and alignment
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 10032
Loading