Keywords: Sparse Autoencoders, Activation Steering, Multilingual Language Models, Interpretability
Abstract: Controlling the output language of multilingual language models via activation-level interventions has shown promising results, but often comes at the cost of generation instability. We investigate whether sparse autoencoder (SAE) features associated with specific languages can be incorporated into training-time objectives to achieve more stable control, and propose \emph{feature-aware supervised fine-tuning}, which integrates feature activation guidance with standard language modeling objectives and distributional regularization. Across several model families and languages, we find that feature-aware training yields limited but consistent improvements in language controllability, while reducing collapse and preserving fluency compared to inference-time steering. Our results reveal a clear trade-off between controllability and stability, and suggest that training-time feature alignment can help regularize this behavior in multilingual language models.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, Multilingualism and Cross-Lingual NLP, Language Modeling
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English, German, French, Spanish, Chinese, Japanese
Submission Number: 1204
Loading