Tight Bounds and Fundamental Impossibility for Knowledge Editing Side Effects in Transformers

11 Mar 2026 (modified: 13 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Knowledge editing enables targeted updates to factual associations in large language models without costly retraining, yet no formal guarantees exist for the unintended side effects these updates introduce, making safe deployment in high-stakes settings certifiably impossible. We close this gap with the first theoretical framework providing provably tight bounds (up to a computable constant $C_\Phi$) on knowledge editing side effects in transformers. Our central theorem establishes tight, computable bounds on how rank-$r$ weight perturbations propagate to unrelated inputs, with all constants made explicit via a non-circular algorithm that avoids the dependency cycles afflicting prior analyses. We further derive edit capacity bounds that predict when sequential edits trigger catastrophic degradation, and prove a fundamental impossibility result: perfect locality and generalization are mutually exclusive under representational superposition, characterizing an inherent Pareto frontier rather than a fixable algorithmic limitation. Experiments across 21,600 edits on GPT-2 and GPT-J, with additional cross-architecture validation on OPT, BLOOM, and LLaMA (Appendix), confirm all theoretical predictions (Spearman $\rho = 0.82$, $p < 10^{-50}$), with the impossibility frontier matching measurements within 3%. Applied as a pre-deployment safety screen on GPT-2 with ROME, our bounds raise locality from 67.1% to 92.3%, demonstrating immediate practical value.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Di_Wang1
Submission Number: 7889
Loading