Keywords: clarification questions, ambiguity resolution, instruction following, sparse autoencoders, mechanistic interpretability, embodied ai
Abstract: Instruction-tuned LLMs often resolve ambiguous instructions by guessing missing details rather than asking clarifying questions, which can cause task failure or safety risks for embodied agents. We propose ClarifySAE, an inference-time method that steers clarification-seeking by intervening on Sparse Autoencoder (SAE) features. ClarifySAE ranks features with ClarifyScore, a clarification-specific adaptation of ReasonScore, and filters them with OutputScore to retain features that affect the output distribution. We evaluate on AmbiK and ClarQ-LLM datasets using Gemma-2B, Gemma-9B, Llama-1B, and Llama-8B. On AmbiK, for Gemma-2-9B-IT clarification rate improves from 0.61 to 0.95 and task success from 0.06 to 0.21. On ClarQ-LLM, the best single-feature configurations improve success from 0.50 to 0.80 for Gemma-9B and from 0.40 to 0.90 for Llama-1B, with similar gains in step recall. Overall, ClarifySAE provides a lightweight, training-free mechanism for modulating clarification behavior through SAE features. The source code will be made publicly available.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: task-oriented, embodied agents
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: english
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 16810
Loading