HAP-E: HESSIAN-AWARE STRUCTURED PRUNING OF LLMS FOR EFFICIENT INFERENCE

HAP-E: HESSIAN-AWARE STRUCTURED PRUNING OF LLMS FOR EFFICIENT INFERENCE

ICLR 2026 Conference Submission22293 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models (LLMs), Structured pruning, Hessian-aware pruning, Optimal Brain Surgeon (OBS), Greedy-consistent batch pruning, Latency-aware compression, Hardware-constrained inference

TL;DR: We propose a scalable, Hessian-aware pruning framework for LLMs that accounts for cross-layer interactions by adaptively selecting candidates, certifying OBS-equivalent batches for pruning, and integrating latency prediction for constrained inference

Abstract: Large language models (LLMs) deliver strong performance across diverse tasks, yet their heavy compute and memory demands make deployment on real-time edge devices challenging. Structured pruning has become the standard approach to reduce these costs, yet accurately estimating which blocks can be removed remains challenging at scale. Second-order methods such as Optimal Brain Surgeon (OBS) are computationally intractable at LLM scale. Existing approaches rely on static budgets that ignore cross-layer dependencies, and common proxies like FLOPs misestimate real hardware latency. We introduce HAP-E, a scalable, Hessian-aware pruning framework for post-training compression of LLMs. HAP-E adaptively reallocates budgets across layers using global screening and selective second-order analysis on a candidate set guided by cross-layer sensitivity estimation. It further performs OBS-equivalent batch pruning that certifies and removes multiple blocks at once while exactly matching the greedy OBS sequence, thereby reducing weight updates and numerical drift. A lightweight latency predictor ensures that the compressed model satisfies inference-time constraints. Experiments on LLaMA and OPT models show that HAP-E improves accuracy by up to 3% over state-of-the-art structured pruning methods at comparable pruning ratios.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 22293

Loading