CAMEL: Counterfactuals As a Means for EvaLuating faithfulness of attribution methods in causal language models

ACL ARR 2024 June Submission5414 Authors

16 Jun 2024 (modified: 14 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the widespread adoption of decoder-only autoregressive language models, explainability evaluation research has predominantly focused on encoder-only models, specifically masked language models (MLMs). Evaluating the faithfulness of an explanation method—how accurately the method explains the inner workings and decision-making of the model—is very challenging because it is very hard to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove some input tokens considered important according to a particular attribution (feature importance) method and observe the change in the model's output. While these faithfulness evaluation techniques are suitable for MLMs, as they involve corrupted or masked inputs during pretraining, they create out-of-distribution inputs for CLMs due to the fundamental difference in their training objective of next token prediction. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language modeling scenarios. Our technique creates natural, fluent, and in-distribution counterfactuals, something that we show is important for a faithfulness evaluation method. We apply our method to several attribution methods and evaluate their faithfulness in predicting the important tokens of a few large language models.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: interpretation,large-language-models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5414
Loading