SkipGPT: Each Token is One of a Kind

Anhao Zhao; Fanghua Ye; Yingqi Fan; Junlong Tong; Jing Xiong; Zhiwei Fei; Hui Su; Xiaoyu Shen

SkipGPT: Each Token is One of a Kind

Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Jing Xiong, Zhiwei Fei, Hui Su, Xiaoyu Shen

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose SkipGPT, a dynamic pruning framework that adapts to token complexity, decouples MLP and attention, and uses a two-stage training paradigm, achieving 40% fewer parameters while preserving or improving performance.

Abstract: Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) *horizontal dynamics*, where token-level heterogeneity demands context-aware pruning decisions, and (2) *vertical dynamics*, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce **SkipGPT**, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: https://github.com/EIT-NLP/SkipGPT.

Lay Summary: Large language models like ChatGPT are very powerful but also very expensive to run because they require a lot of computer processing. This paper introduces a new method called SkipGPT that helps these models run faster and more efficiently. Instead of treating every word and every part of the model the same, SkipGPT figures out which words are most important and which parts of the model are actually needed. By skipping unnecessary steps, it can cut the size of the model by about 40%—while still performing just as well, or even better, than before. This means we can build smarter AI systems that cost less and use less energy, making them easier to use in real-world applications like phones, robots, or smart assistants.

Link To Code: https://github.com/EIT-NLP/SkipGPT

Primary Area: Deep Learning->Large Language Models

Keywords: compression, pruning, layer skipping, efficiency, large language models, model optimization

Submission Number: 11885

Loading