Backoff Decoding: A Language Model Inference Acceleration Framework with a Tunable Efficiency-Performance Tradeoff

Maximilian Holsman; Yukun Huang; Rickard Stureborg; Bhuwan Dhingra

Backoff Decoding: A Language Model Inference Acceleration Framework with a Tunable Efficiency-Performance Tradeoff

Maximilian Holsman, Yukun Huang, Rickard Stureborg, Bhuwan Dhingra

28 Sept 2024 (modified: 22 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language modeling, inference acceleration, decoding strategies

TL;DR: We introduce a language model inference acceleration framework that allocates token generations between models of different sizes.

Abstract: In current transformer-based language models, all tokens in a sequence are generated by identical forward passes and thereby incur the same inference cost. However, tokens vary widely in their importance to the overall generation and their difficulty for models to generate correctly, making this equal allocation of inference resources suboptimal. We introduce backoff decoding, a framework for efficient language model inference that dynamically allocates token generations between two (or more) models of different sizes, according to an arbitrary decision function. By modifying how this decision function allocates generations between the differently sized models, users can tune their generation along an efficiency-performance tradeoff to suit the needs of their application. Backoff decoding can be used on any set of models with the same tokenizer and does not require any training or finetuning of the models themselves. As a demonstration of our framework, we show that backoff decoding with a large and a small model can significantly reduce inference cost while sacrificing virtually no performance compared to the standalone large model. We then show that inference costs can be reduced even further, achieving inference accelerations of up to 3-4x in exchange for reductions in model performance, demonstrating an efficiency-performance tunability not found in other inference acceleration techniques.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12627

Loading