Towards Fast Multilingual LLM Inference: Speculative Decoding and  Specialized Drafters

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

ACL ARR 2024 June Submission1167 Authors

14 Jun 2024 (modified: 13 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.

Paper Type: Short

Research Area: Special Theme (conference specific)

Research Area Keywords: Speculative decoding, LLM, multilingual translation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English, German, French, Russian, Japanese, Chinese

Submission Number: 1167

Loading