Keywords: Transformers, Representation Learning, Gradient Dynamics, Pruning, Architectural Efficiency, Causal Inference
TL;DR: Deep Transformer layer redundancy is caused by a structural information bottleneck in the gradient's path. Causal interventions validate this mechanism, which we then leverage to build a superior pruning method and a more efficient architecture
Abstract: Deep Transformers are composed of uniformly stacked residual blocks, yet their deepest layers often add little value. Prevailing explanations attribute this to small gradients, treating a symptom rather than the cause. We identify Gradient Fan-in Asymmetry as the structural driver of redundancy. In Pre-LayerNorm residual stacks, the gradient at a layer is the sum of an identity path and all downstream functional paths, producing a gradient fan-in that decays linearly with depth (and quadratically under deep supervision), yielding rich signals early and sparse for later layers. Across Transformers and ResNets, accumulated training gradients follow the theoretical fan-in and predict post hoc layer importance. Two causal interventions isolate structure as the bottleneck: equalizing per-layer gradient norms does not restore late-layer value, whereas increasing downstream path counts via parameter-shared repetition restores and elevates their impact. Building on this mechanism, we propose CascadeFlow Pruning, which removes layers using accumulated training gradients and outperforms standard heuristics without expensive post hoc analysis. We also introduce CascadeFormer, which tapers width with depth to match the natural information flow, achieving comparable perplexity to a uniform baseline at the same training budget while reducing latency by 8.6\% and increasing throughput by 9.4\%.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 24034
Loading