Error dynamics of symbolic context in small transformers

Error dynamics of symbolic context in small transformers

ICLR 2026 Conference Submission21968 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Error propagation, Mechanistic interpretability, Arithmetic sequences, Small transformers

Abstract: Language models often recover from partial corruption in their inputs, yet the mechanism behind this spontaneous context restoration is unclear. We study controlled, label-preserving corruptions in symbolic arithmetic and find a consistent mid-to-late-layer elbow where later components integrate surviving cues to reconstruct the answer. We introduce two readouts, Repair Difference (RD), a logit-space contribution measure, and Token Agreement (TA), a layer-wise consistency score, and a linearity-scale test that predicts repairability. We find near-linear behavior on clean inputs and pronounced nonlinearity under corruption; the linearity residual predicts repair success. Across model families, accuracy degrades smoothly with corruption ($\rho \approx -1$) and yields compact robustness summaries ($\tau_{50} \approx 27$--$34\%$). RD/TA peak near the elbow, localizing where repair occurs. Brief fine-tuning at moderate corruption improves self-repair, whereas training on heavy corruption weakens it, giving a simple, data-efficient recipe. To test the linearity claim beyond arithmetic, we replicate the context perturbation correlation to the local non-linearity in the NLP corruption task. Together, RD, TA, $\tau_{50}$, and the linearity test form a concise toolkit for diagnosing and training for spontaneous context restoration on corrupted contexts and actionable guidance for when and how models repair corrupted context, offering practical levers for debugging, evaluation, and training.

Primary Area: interpretability and explainable AI

Submission Number: 21968

Loading