Keywords: weight initialization, depth-asymmetric scaling, activation-path count, training dynamics, loss-landscape flatness, convergence acceleration, generalization, robustness, Lottery-Ticket Hypothesis, deep neural networks
TL;DR: We propose Layer-Progressive Variance Scaling (LPVS), a simple, one-line modification to standard initializers (e.g., Kaiming/He), balancing stability vs. feature selectivity.
Abstract: Weight initialization is typically designed to preserve signal variance for training stability. We argue for a complementary goal: biasing the initial network toward a state that actively facilitates learning. While classical Xavier/Kaiming initializers ensure numerical stability, they can be slow to amplify task-relevant signals and suppress input-level noise. We propose Layer-Progressive Variance Scaling (LPVS), a one-line wrapper around any analytical initializer that applies a depth-asymmetric schedule: it geometrically shrinks variance in early layers and amplifies it in later ones. We provide direct mechanistic evidence that this "suppress-then-amplify" strategy functions as an effective information filter, measurably reducing noise propagation while creating strong, active gradients across all layers. This leads to a higher effective path count and a provably U-shaped Jacobian spectrum, jointly contributing to a flatter loss landscape and accelerated optimization. On CIFAR-10, ImageNet, and IWSLT'14 Transformers, LPVS raises first-epoch accuracy by 3-10 pp, reaches key accuracy milestones up to four epochs sooner, and improves final peak performance. As a lightweight and computationally-free method, LPVS offers a principled upgrade to the initialization toolkit, shifting the focus from stability to creating an information-rich substrate for learning.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 21513
Loading