Keywords: large language model, faithfulness, safety, explainability
TL;DR: We introduce a novel method for measuring the faithfulness of explanations given by LLMs.
Abstract: Large language models (LLMs) are capable of producing plausible explanations of how they arrived at an answer to a question. However, these explanations can be unfaithful to the model's true underlying behavior, potentially leading to over-trust and misuse. We introduce a new approach for measuring the faithfulness of explanations provided by LLMs. Our first contribution is to translate an intuitive understanding of what it means for an LLM explanation to be faithful into a formal definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level *concepts* in the input question that are influential in decision-making. We formalize faithfulness in terms of the difference between the set of concepts that the LLM *says* are influential and the set that *truly* are. We then present a novel method for quantifying faithfulness that is based on: (1) using an auxiliary LLM to edit, or perturb, the values of concepts within model inputs, and (2) using a hierarchical Bayesian model to quantify how changes to concepts affect model answers at both the example- and dataset-level. Through preliminary experiments on a question-answering dataset, we show that our method can be used to quantify and discover interpretable patterns of unfaithfulness, including cases where LLMs fail to admit their use of social biases.
Submission Number: 95
Loading