Keywords: LLM;Hybrid Inference;SLM;Language Models;Efficiency
TL;DR: We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance.
Abstract: Large language models (LLMs) are widely used for text generation, but their size and reliance on autoregressive decoding increase deployment costs and latency. We propose a hybrid approach that combines different-sized language models to improve efficiency while maintaining performance. Our method uses a pretrained LLM to encode prompt tokens in parallel, guiding a small language model (SLM) to generate responses more efficiently. By combining encoder-decoder LLMs with encoder-decoder and decoder-only SLMs, we achieve up to 4x speedup with minor performance penalties of 1-2% for translation and summarization tasks compared to the LLM.
Submission Number: 27
Loading