Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

ACL ARR 2025 May Submission5548 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, \textsc{ToneBank} and \textsc{DebateMix}, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability, Activation Steering, Non-linear Probes, Residual Stream Analysis, Gradient-Based Interventions, Model Edits, Multi-Attribute Steering, MLP-Guided Steering, K-Steering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5548
Loading