Steering Large Language Models Toward Clarification through Sparse Autoencoders

Steering Large Language Models Toward Clarification through Sparse Autoencoders

ACL ARR 2026 January Submission8320 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SAE, ClarifySAE, OutputScore, ReasonScore, LLM, clarification

Abstract: Instruction-tuned LLMs often respond to ambiguous instructions by guessing missing details rather than asking clarifying questions. Clarification-seeking improves reliability by aligning responses with user intent and avoiding under-specified assumptions. This is especially important for embodied AI, where misinterpretations can translate into task failure or safety risks. We propose ClarifySAE, an inference-time method that steers clarification-seeking by intervening on Sparse Autoencoder (SAE) features. ClarifySAE ranks SAE features using ClarifyScore, which measures association with clarification contexts, and filters them with OutputScore to retain features that measurably affect the model’s output distribution. During decoding, we apply additive biases to the selected features, increasing the likelihood of generating a clarifying question without updating model weights. We evaluate our method on two datasets with ambiguous instructions (AmbiK and ClarQ-LLM) and two Gemma instruction-tuned models (2B and 9B) using pretrained 16k-feature SAEs. On AmbiK with Gemma-2-9B-IT, ClarifySAE increases clarification rate from 0.61 to 0.95 and improves task success from 0.06 to 0.21. Our code will be publicly available.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: evaluation and metrics, task-oriented, embodied agents, applications, multi-modal dialogue systems

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 8320

Loading