ClarifySAE: Steering Large Language Models Toward Clarification through Sparse Autoencoders

ClarifySAE: Steering Large Language Models Toward Clarification through Sparse Autoencoders

ACL ARR 2026 May Submission16810 Authors

26 May 2026 (modified: 08 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: clarification questions, ambiguity resolution, instruction following, sparse autoencoders, mechanistic interpretability, embodied ai

Abstract: Instruction-tuned LLMs often resolve ambiguous instructions by guessing missing details rather than asking clarifying questions, which can cause task failure or safety risks for embodied agents. We propose ClarifySAE, an inference-time method that steers clarification-seeking by intervening on Sparse Autoencoder (SAE) features. ClarifySAE ranks features with ClarifyScore, a clarification-specific adaptation of ReasonScore, and filters them with OutputScore to retain features that affect the output distribution. We evaluate on AmbiK and ClarQ-LLM datasets using Gemma-2B, Gemma-9B, Llama-1B, and Llama-8B. On AmbiK, for Gemma-2-9B-IT clarification rate improves from 0.61 to 0.95 and task success from 0.06 to 0.21. On ClarQ-LLM, the best single-feature configurations improve success from 0.50 to 0.80 for Gemma-9B and from 0.40 to 0.90 for Llama-1B, with similar gains in step recall. Overall, ClarifySAE provides a lightweight, training-free mechanism for modulating clarification behavior through SAE features. The source code will be made publicly available.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: task-oriented, embodied agents

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: english

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 16810

Loading