Steering Large Language Models Toward Clarification through Sparse Autoencoders
Keywords: LLM, SAE, clarification, Embodied AI, interpretability
TL;DR: ClarifySAE steers instruction-tuned LLMs toward clarification at inference time by amplifying clarification-linked SAE features (via ClarifyScore and OutputScore), without retraining.
Abstract: Instruction-tuned LLMs often respond to ambiguous instructions by guessing missing
details rather than asking clarifying questions. Clarification-seeking improves reliability by aligning responses with user intent and avoiding under-specified assumptions. This is especially important for embodied AI, where misinterpretations can translate into task failure or safety risks. We propose \textbf{ClarifySAE}, an inference-time method that steers clarification-seeking by intervening on Sparse Autoencoder (SAE) features. ClarifySAE ranks SAE features using ClarifyScore, which measures association with clarification contexts, and filters them with OutputScore to retain features that measurably affect the model’s output distribution. During decoding, we apply additive biases to the selected features, increasing the likelihood of generating a clarifying question without updating model weights. We evaluate our method on two datasets with ambiguous instructions (AmbiK and ClarQ-LLM) and two Gemma instruction-tuned models (2B and 9B) using pretrained 16k-feature SAEs. On AmbiK with Gemma-2-9B-IT, ClarifySAE increases clarification rate from 0.61 to 0.95 and improves task success from 0.06 to 0.21.
Submission Number: 58
Loading