Keywords: Compression, Pruning, SSMs, Hybrid LLM
TL;DR: The paper introduces a pruning and distillation method for hybrid LLMs, compressing Nemotron-H 8B to 4B with better accuracy and ~2× faster inference, advancing the efficiency-accuracy trade-off.
Abstract: Hybrid language models that combine Attention and State Space Models (SSMs) have been shown to achieve state-of-the-art accuracy and runtime performance. Recent work has also demonstrated that applying pruning and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. To this end, we introduce a novel group-aware pruning method for Mamba layers that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. We combine this method with FFN, embedding dimension, and layer pruning, along with knowledge distillation-based retraining to obtain a unified compression recipe for hybrid models. Using this recipe, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to $40\times$ fewer training tokens compared to similarly-sized models. The resulting model surpasses the accuracy of similarly-sized models while achieving $\sim2\times$ faster inference throughput, significantly advancing the Pareto frontier.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 13329
Loading