Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Published: 05 Mar 2024, Last Modified: 08 May 2024ICLR 2024 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, faithfulness, safety, explainability
TL;DR: We introduce a novel method for measuring the faithfulness of explanations given by LLMs.
Abstract: Large language models (LLMs) are capable of producing compelling explanations of their reasoning when answering questions. However, LLM explanations can be unfaithful to the model's true underlying behavior, potentially leading to over-trust and misuse. In this work, we introduce a new approach for measuring explanation faithfulness that is tailored to LLMs. Our first contribution is to translate an intuitive understanding of what it means for an LLM explanation to be faithful into a formal definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level *concepts* in the input question that are influential in decision-making. We formalize faithfulness in terms of the difference between the set of concepts that the LLM *says* are influential and the set that *truly* are. We then present a novel method for quantifying faithfulness that is based on: (1) using an auxiliary LLM to edit, or perturb, the values of concepts within model inputs, and (2) using a hierarchical Bayesian model to quantify how changes to concepts affect model answers at both the example- and dataset-level. Through preliminary experiments on a question-answering dataset, we show that our method can be used to quantify and discover interpretable patterns of unfaithfulness, including cases where LLMs fail to admit their use of social biases.
Submission Number: 79
Loading