Feature-Aware Training with Sparse Autoencoder Features for Multilingual Language Control

Feature-Aware Training with Sparse Autoencoder Features for Multilingual Language Control

ACL ARR 2026 January Submission1204 Authors

28 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Activation Steering, Multilingual Language Models, Interpretability

Abstract: Controlling the output language of multilingual language models via activation-level interventions has shown promising results, but often comes at the cost of generation instability. We investigate whether sparse autoencoder (SAE) features associated with specific languages can be incorporated into training-time objectives to achieve more stable control, and propose \emph{feature-aware supervised fine-tuning}, which integrates feature activation guidance with standard language modeling objectives and distributional regularization. Across several model families and languages, we find that feature-aware training yields limited but consistent improvements in language controllability, while reducing collapse and preserving fluency compared to inference-time steering. Our results reveal a clear trade-off between controllability and stability, and suggest that training-time feature alignment can help regularize this behavior in multilingual language models.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Multilingualism and Cross-Lingual NLP, Language Modeling

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency

Languages Studied: English, German, French, Spanish, Chinese, Japanese

Submission Number: 1204

Loading