Don’t Think of the White Bear: Ironic Negation in Transformer Models under Cognitive Load

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ironic process theory, negation, cognitive load, large language models (LLMs), transformer architecture, AI safety, mechanistic interpretability, long-context reasoning, cognitive controlthought suppression, psycholinguistics, computational cognitive modeling, behavioral probes, in-context learning, cognitive control, attentional load, interference, representation analysis, AI alignment
TL;DR: LLMs exhibit human-like negation failure under cognitive load, linking stronger semantic representations to poorer cognitive control—revealing a key capability vs. controllability trade-off.
Abstract: Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. Load & content: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. Polarity separation: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.
Submission Number: 112
Loading