Exposing Attention Glitches with Flip-Flop Language Modeling

Published: 21 Sept 2023, Last Modified: 20 Jan 2024NeurIPS 2023 spotlightEveryoneRevisionsBibTeX
Keywords: Transformers, language models, hallucinations, long-range dependencies, generalization, extrapolation, out-of-distribution
TL;DR: Transformers fail to robustly keep track of a single bit of memory. The glitches are surprisingly subtle and persistent. We hypothesize that this accounts for some "closed-domain hallucinations".
Abstract: Why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? The brittleness of these models, particularly when executing long chains of reasoning, currently seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought. Towards making sense of this fundamentally unsolved problem, this work identifies and analyzes the phenomenon of _attention glitches_, in which the Transformer architecture's inductive biases intermittently fail to capture robust reasoning. To isolate the issue, we introduce _flip-flop language modeling_ (FFLM), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. This simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques. Our preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. We hypothesize that attention glitches account for (some of) the closed-domain hallucinations in natural LLMs.
Supplementary Material: pdf
Submission Number: 4247