BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Models Inference

Yuhua Zhou; Shaoqi Yu; Shichao Weng; Changhai Zhou; Mingze Yin; Fei Yang; Aimin PAN

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Models Inference

Yuhua Zhou, Shaoqi Yu, Shichao Weng, Changhai Zhou, Mingze Yin, Fei Yang, Aimin PAN

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Compression, Dynamic Inference

Abstract: Large language models require substantial computational resources for inference due to their massive number of parameters. Model layer pruning accelerates inference by eliminating redundant layers. However, existing layer pruning methods fail to meet users' flexible budget constraints and lack the ability to adaptively adjust the inference path. To address these issues, we propose Buddy, a budget-driven and adaptive inference framework. Specifically, we design a Decision Module that adaptively selects important layers to execute based on user input while satisfying a given budget constraint. Additionally, Buddy reuses the KV cache from the first layer and dynamically updates the context during inference, enabling adaptive adjustments to the inference path based on evolving contextual information. Furthermore, when no explicit budget is provided, a Budget Predictor automatically determines an appropriate inference cost to achieve an optimal trade-off between performance and computational efficiency. Extensive experiments on the Llama model demonstrate that Buddy consistently outperforms baseline methods under various pruning configurations.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5872

Loading