PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference

17 Sept 2025 (modified: 15 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, disaggregation
TL;DR: We propose a method that integrated with PD disaggregation.
Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a cache pruning mechanism that selectively reuses entries corresponding to the first and last token sequences within designated layers, reducing communication costs while incurring only negligible computational overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the same (default) settings, our method achieves improved performance and faster inference, along with a 4.95$\times$ reduction in data transmission bandwidth consumption.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 8296
Loading