Explaining black box text modules in natural language with language models

Published: 27 Oct 2023, Last Modified: 06 Nov 2023NeurIPS XAIA 2023EveryoneRevisionsBibTeX
Abstract: Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A *text module* is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. *Black box* indicates that we only have access to the module's inputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation. We study SASC in 2 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals.
Submission Track: Full Paper Track
Application Domain: Natural Language Processing
Survey Question 1: Large language models, such as ChatGPT, consist of a large number of modules, which are difficult to interpret efficiently. We propose a method, called SASC, that helps to automatically explain the function of a module with a short natural-language description.
Survey Question 2: Existing approaches for automatically explaining text modules are limited, often requiring a great deal of human effort in sifting through text inputs and module outputs to guess what a module is doing. Our approach helps to automate this process.
Survey Question 3: We use large language models themselves to generate and evaluate explanations.
Submission Number: 16