Compact Example-Based Explanations for Language Models

Compact Example-Based Explanations for Language Models

ACL ARR 2026 January Submission3303 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: example-based explanations, training data influence estimation, training data attribution, interpretability

Abstract: Training data influence estimation methods quantify the contribution of training documents to a model’s output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel *selection relevance score*, a retraining-free metric that quantifies how useful a set of examples is for explaining a model's output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model's predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: ACL 2026 Special Theme: Explainability of NLP Models

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: en

Submission Number: 3303

Loading