Keywords: Activation Engineering, Safety, Alignment, Steering Vector
TL;DR: Adaptive activation steering delivers context-specific safety steering within an SAE-compressed subspace on-the-fly according to input.
Abstract: LLMs achieve strong capabilities, yet precisely steering their responses with ever-shifting safety requirements remains unresolved.
Current activation engineering methods embed a static premise — prompt categories elicit distinct activation patterns — and coerce each input into a hand-crafted semantic category.
This premise fails when adversarial prompt variations (e.g. jailbreaks) perturb the activations, yielding collateral suppression or undetected risks.
We contend that the activation steering should be determined on-the-fly by the input itself within the semantic space, rather than being predetermined by rigid, hand-crafted categories.
In this paper, we propose Context-Specific Steering (COS-Steering), which maps the full safety-steering activation subspace and lets inputs locate its own steering coordinately.
COS-Steering recover this subspace by compressing a pool of steering signals into a compact set of basis vectors via SAE.
A lightweight module then reads the input activation and outputs weights these basis vectors for context-specific steering.
To evaluate robustness against distribution shift, we test COS-Steering in a mixed-attack setting, which combines multiple attack methods, across datasets and models.
Comparing to baselines, COS-Steering preserves strong refusal on harmful prompts while introducing negligible side-effects on benign queries.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24814
Loading