Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner; Andrii Skliar; Amelie Royer; Tijmen Blankevoort; Yuki M Asano; Babak Ehteshami Bejnordi

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki M Asano, Babak Ehteshami Bejnordi

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM;Hybrid Inference;SLM;Language Models;Efficiency

TL;DR: We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance.

Abstract: Large language models (LLMs) are widely used for text generation, but their size and reliance on autoregressive decoding increase deployment costs and latency. We propose a hybrid approach that combines different-sized language models to improve efficiency while maintaining performance. Our method uses a pretrained LLM to encode prompt tokens in parallel, guiding a small language model (SLM) to generate responses more efficiently. By combining encoder-decoder LLMs with encoder-decoder and decoder-only SLMs, we achieve up to 4x speedup with minor performance penalties of 1-2% for translation and summarization tasks compared to the LLM.

Submission Number: 27

Loading