Actionable Explainability for LLMs via Semantic Attributions and Steering-Based Evaluation

Actionable Explainability for LLMs via Semantic Attributions and Steering-Based Evaluation

ACL ARR 2026 January Submission1804 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Explainability, Actionability, Semantic, Steering, Attribution, Evaluation

Abstract: Understanding what in the input drives large language model (LLM) generation and how to evaluate these explanations remain challenging. Existing methods rely on token-level attributions that are difficult for humans to interpret and are assessed separately through model-centric faithfulness or human-centric plausibility. We introduce SemeX, an attribution-based explainability method that identifies semantically meaningful input words responsible for model behavior, while preserving grammatical coherence and remaining agnostic to output length. We validate SemeX through both model- and human-centric evaluations, showing that its explanations are faithful and align with human judgments. Building on this, we propose actionability as a unified evaluation criterion and quantify it via steering effectiveness, measuring whether explanations can meaningfully steer a model toward a desired output. Overall, our work reconciles model- and human-based evaluation by introducing a unified, actionability-driven framework for assessing explanation quality.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Human-Centered NLP, Language Modeling, Semantics: Lexical and Sentence-Level

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1804

Loading