Keywords: K-FAC, second-order optimization, deep learning, block-diagonal approximation
Abstract: We introduce Block-Diagonal K-FAC (BD-KFAC), a second-order optimizer that preserves block-diagonal structure in the Kronecker factors of K-FAC to balance curvature fidelity and resource efficiency. Concretely, we partition activations and pre-activation gradients into per-layer blocks and perform eigendecomposition per block, while precomputing damped block-wise inverses to amortize per-step costs and reduce communication in distributed training. On CIFAR-100 across both CNN and ViT architectures, BD-KFAC achieves faster wall-clock convergence than baselines and uses less memory and computation resource than K-FAC under a unified training setup. Overall, BD-KFAC offers a practical middle ground between full-matrix and purely diagonal second-order methods.
Submission Number: 35
Loading