Keywords: Explainability, Actionability, Semantic, Steering, Attribution, Evaluation
Abstract: Understanding what in the input drives large language model (LLM) generation and how to evaluate these explanations remain challenging. Existing methods rely on token-level attributions that are difficult for humans to interpret and are assessed separately through model-centric faithfulness or human-centric plausibility. We introduce SemeX, an attribution-based explainability method that identifies semantically meaningful input words responsible for model behavior, while preserving grammatical coherence and remaining agnostic to output length. We validate SemeX through both model- and human-centric evaluations, showing that its explanations are faithful and align with human judgments. Building on this, we propose actionability as a unified evaluation criterion and quantify it via steering effectiveness, measuring whether explanations can meaningfully steer a model toward a desired output. Overall, our work reconciles model- and human-based evaluation by introducing a unified, actionability-driven framework for assessing explanation quality.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, Human-Centered NLP, Language Modeling, Semantics: Lexical and Sentence-Level
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1804
Loading