Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM;Hybrid Inference;SLM;Language Models;Efficiency
TL;DR: We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance.
Abstract: Large language models (LLMs) are widely used for text generation, but their size and reliance on autoregressive decoding increase deployment costs and latency. We propose a hybrid approach that combines different-sized language models to improve efficiency while maintaining performance. Our method uses a pretrained LLM to encode prompt tokens in parallel, guiding a small language model (SLM) to generate responses more efficiently. By combining encoder-decoder LLMs with encoder-decoder and decoder-only SLMs, we achieve up to 4x speedup with minor performance penalties of 1-2% for translation and summarization tasks compared to the LLM.
Submission Number: 27
Loading