Prompt-based Depth Pruning of Large Language Models

Juyun Wee; Minjae Park; Jaeho Lee

Prompt-based Depth Pruning of Large Language Models

Juyun Wee, Minjae Park, Jaeho Lee

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We develop a prompt-based depth pruning algorithm.

Abstract: Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent---a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (**P**rompt-ro**u**ted **D**ynamic **D**epth Prun**ing**), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

Lay Summary: Large language models (LLMs) deliver impressive reasoning and question-answering abilities, but executing every layer of these massive networks is both slow and expensive. A widely used shortcut—“depth pruning”—simply removes certain transformer blocks to speed up inference, but this static approach often degrades performance: a layer that appears redundant for one input may be critical for another. We propose PuDDing (Prompt-routed Dynamic Depth Pruning), which dynamically skips layers based on the content of each prompt. First, we profile real examples to create several candidate pruned models. Then, at inference time, a lightweight router analyzes the incoming prompt and selects the variant that retains only the most relevant layers. This prompt-aware strategy accelerates LLM inference without requiring specialized hardware, and even improves accuracy on challenging reasoning benchmarks. By avoiding unnecessary computation, PuDDing lowers energy consumption and brings advanced language understanding closer to real-time applications.

Primary Area: Deep Learning->Algorithms

Keywords: Depth pruning, Model Compression

Submission Number: 4378

Loading