Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive DecodingDownload PDF

Anonymous

16 Dec 2023 (modified: 20 Dec 2023)ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance.
Abstract: Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of LLM encoders with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, often with only minor performance penalties of $1-2\%$ compared to the LLM.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English,French,Romanian,German
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview