In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs
Keywords: Large Language Models; continual learning; knowledge editing; counterfactual updates; catastrophic interference; catastrophic forgetting; selective plasticity; neuron targeting; sparsity; conflict detection; robustness to contradictions; model safety
TL;DR: LLMs can learn non-contradictory facts safely, but updating them with counterfactuals corrupts totally unrelated knowledge. Targeting “plastic” vs. “stubborn” neurons helps only for non-contradictory updates; counterfactual harm persists.
Abstract: Through systematic empirical investigation, we uncover a fundamental property of large language models (LLMs) with implications for continual learning: they can safely learn facts that do not contradict existing knowledge, but attempts to update them with counterfactuals cause catastrophic corruption of *unrelated* knowledge. Unlike humans, who naturally resist conflicting information, LLMs have no such safeguards by design. This leads to severe interference, destroying up to 80% of unrelated factual knowledge even for as few as 10–100 counterfactual updates. To test whether selective plasticity can mitigate this damage, we perform targeted updates, distinguishing between previously used (*stubborn*) and rarely used (*plastic*) neurons. We find again an asymmetry: sparing frequently used neurons improves retention for non-contradictory updates (98% retained vs 93% under standard updates), yet counterfactual updates trigger catastrophic interference regardless of targeting. This effect, which persists across tested models and scales (from GPT-2 to GPT-J-6B, as well as the GPT-4.1 family), suggests a general property of current LLMs. Finally, we show that counterfactual inputs can be detected with ≥95% accuracy using simple model features, pointing to a practical safeguard. These findings motivate research on architectures that, like humans, naturally resist contradictions rather than allowing destructive overwrites.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24308
Loading