On Rademacher Complexity-based Generalization Bounds for the Transformer Architecture

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generalization Bounds, Rademacher Complexity, Transformers, Self-Attention, Deep Learning Theory
TL;DR: We derive the first formal generalization bound for Transformers using Rademacher complexity, showing that their performance on unseen data is controlled by weight norms, depth, and sequence length
Abstract: We derive the first end-to-end, data-dependent generalization bound for the Transformer architecture to explain its strong empirical performance. Using Rademacher complexity and a novel Lipschitz analysis of self-attention, we construct a bound for deep, L-layer models. The bound demonstrates that generalization capacity is governed by depth, sequence length, and a polynomial of the model's weight norms. A numerical sanity check validates our theoretical scaling with model depth, providing a new, formal lens to understand and improve Transformers.
Primary Area: learning theory
Submission Number: 1422
Loading