Keywords: Large Language Models, Safety, Steering, Activation Engineering, Controller Network, Refusal Detection, Inference-Time Control
Abstract: Controlling undesirable LLM behaviors typically requires costly fine-tuning, while existing inference-time steering methods lack fine-grained adaptivity. We introduce a lightweight, trainable controller network for adaptive inference-time control. The controller observes intermediate LLM activations to predict a global scaling factor and layer-specific weights, which dynamically modulate a pre-computed "refusal direction" vector. Trained on harmful and benign prompts, the controller learns to apply nuanced, layer-aware steering selectively. Experiments on Llama and Mistral models show our method significantly increases refusal rates on safety benchmarks like ToxicChat, outperforming existing approaches without altering the original model parameters.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling, Computational Social Science and Cultural Analytics, NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2814
Loading