In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs

Simone Clemente; Zied Ben Houidi; Alexis Huet; Dario Rossi; Giulio Franzese; Pietro Michiardi

In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs

Simone Clemente, Zied Ben Houidi, Alexis Huet, Dario Rossi, Giulio Franzese, Pietro Michiardi

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; continual learning; knowledge editing; counterfactual updates; catastrophic interference; catastrophic forgetting; selective plasticity; neuron targeting; sparsity; conflict detection; robustness to contradictions; model safety

TL;DR: LLMs can learn non-contradictory facts safely, but updating them with counterfactuals corrupts totally unrelated knowledge. Targeting “plastic” vs. “stubborn” neurons helps only for non-contradictory updates; counterfactual harm persists.

Abstract: Through systematic empirical investigation, we uncover a fundamental property of large language models (LLMs) with implications for continual learning: they can safely learn facts that do not contradict existing knowledge, but attempts to update them with counterfactuals cause catastrophic corruption of *unrelated* knowledge. Unlike humans, who naturally resist conflicting information, LLMs have no such safeguards by design. This leads to severe interference, destroying up to 80% of unrelated factual knowledge even for as few as 10–100 counterfactual updates. To test whether selective plasticity can mitigate this damage, we perform targeted updates, distinguishing between previously used (*stubborn*) and rarely used (*plastic*) neurons. We find again an asymmetry: sparing frequently used neurons improves retention for non-contradictory updates (98% retained vs 93% under standard updates), yet counterfactual updates trigger catastrophic interference regardless of targeting. This effect, which persists across tested models and scales (from GPT-2 to GPT-J-6B, as well as the GPT-4.1 family), suggests a general property of current LLMs. Finally, we show that counterfactual inputs can be detected with ≥95% accuracy using simple model features, pointing to a practical safeguard. These findings motivate research on architectures that, like humans, naturally resist contradictions rather than allowing destructive overwrites.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24308

Loading