- Abstract: One of the difficulties of training deep neural networks is caused by improper scaling between layers. These scaling issues introduce exploding / gradient problems, and have typically been addressed by careful variance-preserving initialization. We consider this problem as one of preserving scale, rather than preserving variance. This leads to a simple method of scale-normalizing weight layers, which ensures that scale is approximately maintained between layers. Our method of scale-preservation ensures that forward propagation is impacted minimally, while backward passes maintain gradient scales. Preliminary experiments show that scale normalization effectively speeds up learning, without introducing additional hyperparameters or parameters.
- Conflicts: cs.umb.edu