Block-Diagonal K-FAC: A Trade-off Between Curvature Information and Resource Efficiency

Mingzhe Yu; Osamu Tatebe

Block-Diagonal K-FAC: A Trade-off Between Curvature Information and Resource Efficiency

Mingzhe Yu, Osamu Tatebe

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: K-FAC, second-order optimization, deep learning, block-diagonal approximation

Abstract: We introduce Block-Diagonal K-FAC (BD-KFAC), a second-order optimizer that preserves block-diagonal structure in the Kronecker factors of K-FAC to balance curvature fidelity and resource efficiency. Concretely, we partition activations and pre-activation gradients into per-layer blocks and perform eigendecomposition per block, while precomputing damped block-wise inverses to amortize per-step costs and reduce communication in distributed training. On CIFAR-100 across both CNN and ViT architectures, BD-KFAC achieves faster wall-clock convergence than baselines and uses less memory and computation resource than K-FAC under a unified training setup. Overall, BD-KFAC offers a practical middle ground between full-matrix and purely diagonal second-order methods.

Submission Number: 35

Loading