Investigating the Overlooked Hessian Structure: From CNNs to LLMs

Qian-Yuan Tang; Yufei Gu; Yunfeng Cai; Mingming Sun; Ping Li; zhou Xun; Zeke Xie

Investigating the Overlooked Hessian Structure: From CNNs to LLMs

Qian-Yuan Tang, Yufei Gu, Yunfeng Cai, Mingming Sun, Ping Li, zhou Xun, Zeke Xie

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: It is well-known that the Hessian of deep loss landscape matters to optimization and generalization of deep learning. Previous studies reported a rough Hessian structure in deep learning, which consists of two components, a small number of large eigenvalues and a large number of nearly-zero eigenvalues. To the best of our knowledge, we are the first to report that a simple but overlooked power-law Hessian structure exists in well-trained deep neural networks, including Convolutional Neural Networks (CNNs) and Large Language Models (LLMs). Moreover, we provide a maximum-entropy theoretical interpretation for the power-law Hessian structure and theoretically demonstrate the existence of robust and low-dimensional subspace of deep neural networks. Our extensive experiments using the proposed power-law spectral method demonstrate that the power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, and overparameterization. Notably, we discover that the power-law Hessian structure of a given LLM can effectively predict generalization during training, while conventional sharpness-based generalization measures that often works well on CNNs become nearly useless for as a generalization predictor of LLMs.

Lay Summary: Deep loss landscape matters to optimization and generalization of deep learning; however, the Hessian structure is often overlooked in previous studies. We report that a simple power-law Hessian structure exists in well-trained neural networks, including Convolutional Neural Networks (CNNs) and Large Language Models (LLMs). A maximum-entropy theoretical interpretation is provided and theoretically demonstrates the existence of a robust and low-dimensional subspace of deep neural networks. We further provide extensive empirical results under different experiment setups that demonstrate this power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, and overparameterization. Our findings indicate that this power-law structure in the Hessian spectrum offers a novel and promising perspective for understanding neural networks and their behavior.

Primary Area: Deep Learning->Everything Else

Keywords: Hessian, Loss Landscape, Generalization, Large Language Models

Submission Number: 11101

Loading