Steering-Conditioned LoRA via Sparse Autoencoders

08 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Low-Rank Adaptation, Parameter-Efficient Fine-Tuning, Mixture of Experts, Routing Collapse, Sparse Feature Geometry, Conditional Routing, Expert Specialization
TL;DR: We demonstrate that routing LoRA adapters through sparse autoencoder features prevents expert collapse and drives adaptation, with causal ablations showing that geometric feature structure—not generic conditioning—is the core mechanism.
Abstract: Parameter-efficient adaptation specializes large language models (LLMs) while preserving capabilities acquired during pretraining and instruction tuning. Low-Rank Adaptation (LoRA) and routed-adapter methods reduce the number of trainable parameters, but typically allocate updates from dense hidden states, whose entangled coordinates poorly separate overlapping domains. We propose Steering-Conditioned LoRA, which uses Sparse Autoencoder (SAE) features as the control space for low-rank adaptation. The method trains SAEs offline on base-model activations, freezes them, and uses their sparse features to route LoRA specialists and assign token-dependent effective rank. We evaluate Steering-Conditioned LoRA on Qwen3 models from 0.6B to 32B parameters across five domains, SAE-conditioning interventions, and sequential domain orders. Across scales, our method improves average instruction-following scores by $+0.016$ to $+0.052$ and average general-knowledge scores by $+0.017$ to $+0.023$ over base models, whereas gains on harder reasoning benchmarks, including MMLU-Pro and GSM8K, are marginal or negative for larger models. Causal controls show that the SAE signal drives these gains: replacing, randomizing, shuffling, or inverting it reduces target-domain scores by $33.6$--$65.6\%$, whereas removing dynamic rank while preserving SAE routing reduces scores by only $11.8\%$. In sequential adaptation, Steering-Conditioned LoRA reduces mean positive forgetting by $13.1\%$, and by $38.3\%$ when broad instruction-following data precedes narrower domains. These results show that the main bottleneck in routed parameter-efficient adaptation is not adapter capacity alone, but the representation space used to control it.
Submission Number: 113
Loading