Keywords: doubly stochastic matrices, spectral stability, deep residual networks, Birkhoff-von Neumann theorem, hyper-connections, token mixing, extreme depth training, quantization robustness
TL;DR: We propose BE-HC, which uses the Birkhoff-von Neumann theorem to construct exactly doubly stochastic mixing matrices as convex combinations of permutation matrices, enabling stable training at 1000+ layers where prior methods fail.
Abstract: Learnable information routing in deep networks faces the *depth-stability-efficiency trilemma*: architectures that scale to extreme depths often sacrifice efficiency; efficient approaches lack stability guarantees. Prior work uses iterative Sinkhorn-Knopp normalization to approximate doubly stochastic mixing matrices, but residual errors destabilize training beyond several hundred layers. We propose **Birkhoff-Exact Hyper-Connections (BE-HC)**, which leverages the Birkhoff-von Neumann theorem to construct *exactly* doubly stochastic matrices as convex combinations of permutation matrices. This guarantees spectral radius $\rho = 1$ exactly—not approximately—enabling stable training at unprecedented depths. **Key results:** (1) *Extreme depth:* BE-HC trains stably at **1000 layers**, achieving 35.71% accuracy where ReZero and other baselines fail to converge. (2) *Long context:* BE-HC handles **8K tokens** on a single V100 GPU (22.56% validation accuracy), while standard attention fails with out-of-memory errors. (3) *Efficiency:* 1.47× throughput improvement over attention at 4K context length. (4) *Robustness:* 4× better accuracy retention under INT8 quantization than attention. BE-HC resolves the trilemma: exact stability enables depth, permutation structure enables efficiency, and bounded Lipschitz constants enable deployment robustness.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 34
Loading