Overthinking the Truth: Understanding how Language Models process False DemonstrationsDownload PDF

Published: 01 Feb 2023, 19:30, Last Modified: 13 Feb 2023, 23:27Submitted to ICLR 2023Readers: Everyone
Keywords: Large Language Models, Interpretability, Safety, Mechanistic Interpretability, Science of ML
Abstract: Through few-shot learning or chain-of-thought prompting, modern language models can detect and imitate complex patterns in their prompt. This behavior allows language models to complete challenging tasks without fine-tuning, but can be at odds with completion quality: if the context is inaccurate or harmful, then the model may reproduce these defects in its completions. In this work, we show that this {harmful context-following} appears late in a model's computation--in particular, given an inaccurate context, models perform \emph{better} after zeroing out later layers. More concretely, at early layers models have similar performance given either accurate and inaccurate few-shot prompts, but a gap appears at later layers (e.g.~layers 10-14 for GPT-J). This gap appears at a consistent depth across datasets, and coincides with the appearance of “induction heads” that attend to previous answers in the prompt. We restore the performance for inaccurate contexts by ablating a subset of these heads, reducing the gap by 28\% on average across 8 datasets. Our results suggest that studying early stages of computation could be a promising strategy to prevent misleading outputs, and that understanding and editing internal mechanisms can help correct unwanted model behavior.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
21 Replies