Late Breaking Results: Dynamically Scalable Pruning for Transformer-Based Large Language Models

Junyoung Lee, Shinhyoung Jang, Seohyun Kim, Jongho Park, Ilhong Suh, Hoon Sung Chwa, Yeseong Kim

Published: 2025, Last Modified: 25 May 2026DATE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose Matryoshka, a novel framework for transformer model pruning, enabling dynamic runtime controls while maintaining competitive accuracy to modern large language models (LLMs). Matryoshka incrementally constructs submodels with varying complexities, allowing runtime adaptation without maintaining separate models. Our evaluations on LLaMA-7B demonstrate that Matryoshka achieves up to 34% speedup and outperforms the quality of state-of-the-art pruning methods, providing a flexible solution for deploying LLMs.
Loading