CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Wenhao Zheng; Yixiao Chen; Weitong Zhang; Souvik Kundu; Yun Li; Zhengzhong Liu; Eric P. Xing; Hongyi Wang; Huaxiu Yao

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

Published: 10 Oct 2024, Last Modified: 23 Dec 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: collaborative inference, efficient inference, token-level routing, large language model

Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing tasks but suffer from high computational costs during inference, limiting their deployment in latency-constrained applications. To address this issue, we propose a novel \textbf{C}ollaborative \textbf{I}nference with \textbf{T}oken-l\textbf{E}vel \textbf{R}outing (CITER) framework that introduces a token-level routing mechanism, enabling efficient collaboration between small and large language models (SLMs \& LLMs). Specifically, CITER enables routing non-critical tokens to an SLM to reduce computational overhead, while critical tokens are processed by an LLM to maintain generation quality. We formulate the training of the router as a reinforcement learning task, where the router receives rewards based on both the quality of predictions and the inference cost of generation. To further accelerate the reward evaluation process, we introduce a shortcut for reward function estimation, significantly reducing the cost of the reward estimation. Extensive experiments demonstrate that CITER reduces inference cost while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.

Submission Number: 68

Loading