Layer-Scaled Weight Initialization for Efficient Deep Neural Network Optimization

Layer-Scaled Weight Initialization for Efficient Deep Neural Network Optimization

ICLR 2026 Conference Submission21513 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: weight initialization, depth-asymmetric scaling, activation-path count, training dynamics, loss-landscape flatness, convergence acceleration, generalization, robustness, Lottery-Ticket Hypothesis, deep neural networks

TL;DR: We propose Layer-Progressive Variance Scaling (LPVS), a simple, one-line modification to standard initializers (e.g., Kaiming/He), balancing stability vs. feature selectivity.

Abstract: Weight initialization is typically designed to preserve signal variance for training stability. We argue for a complementary goal: biasing the initial network toward a state that actively facilitates learning. While classical Xavier/Kaiming initializers ensure numerical stability, they can be slow to amplify task-relevant signals and suppress input-level noise. We propose Layer-Progressive Variance Scaling (LPVS), a one-line wrapper around any analytical initializer that applies a depth-asymmetric schedule: it geometrically shrinks variance in early layers and amplifies it in later ones. We provide direct mechanistic evidence that this "suppress-then-amplify" strategy functions as an effective information filter, measurably reducing noise propagation while creating strong, active gradients across all layers. This leads to a higher effective path count and a provably U-shaped Jacobian spectrum, jointly contributing to a flatter loss landscape and accelerated optimization. On CIFAR-10, ImageNet, and IWSLT'14 Transformers, LPVS raises first-epoch accuracy by 3-10 pp, reaches key accuracy milestones up to four epochs sooner, and improves final peak performance. As a lightweight and computationally-free method, LPVS offers a principled upgrade to the initialization toolkit, shifting the focus from stability to creating an information-rich substrate for learning.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 21513

Loading