Achieving small-batch accuracy with large-batch scalability via Hessian-aware learning rate adjustment
Abstract: Highlights•Hessian information allows to properly adjust noise scale in large-batch training.•Too early learning rate decay harms underlying margin distribution.•The minimum learning rate after the decay strongly affects the model sharpness.•The length of noise scale transition affects the final generalization performance.
Loading