Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

ACL ARR 2026 January Submission4339 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Steerability, Contrastive Decoding, Alignment, Persona
Abstract: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor $\alpha$. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13\% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Inference methods, Prompting, Safety and alignment, Human-centered evaluation
Languages Studied: English
Submission Number: 4339
Loading