Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

ACL ARR 2026 January Submission8812 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Multilingual Language Models, Language Steering, Representation-Level Interpretability, Mechanistic Interpretability, Layerwise Representation Analysis

Abstract: Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \emph{a priori} steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Mechanistic Interpretability, Language Models Interpretability, Analysis of Models for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English, Tibetan, Maltese, Italian, Spanish, German, Japanese, Arabic, Chinese (Simplified), Afrikaans, Dutch, French, Portuguese, Russian, Korean, Hindi, Turkish, Polish, Swedish, Danish, and Norwegian Bokmål

Submission Number: 8812

Loading