Residual Connections Relay Generalization but Not Memorization in Transformers

Residual Connections Relay Generalization but Not Memorization in Transformers

ICLR 2026 Conference Submission221 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Residual connections, Transformers, Memorization, Generalization

Abstract: Residual connections are one of the main components in transformers, helping stabilize training and improve optimization, yet it remains unclear how they influence memorization, a behavior that transformers are known to exhibit, especially in overparameterized regimes. Therefore, in this work, we investigate the impact of residual connections on memorization in transformers. Our analysis shows that residual connections do not influence memorization; instead, their removal primarily impairs learning, which is a novel finding. Furthermore, we find that residual connections in early layers are significantly more important for performance than those in later layers. To explain these findings, we perform a gradient flow and output margin analysis, demonstrating how residual connections support learning dynamics without propagating memorization.

Primary Area: interpretability and explainable AI

Submission Number: 221

Loading