On multi-token prediction for efficient LLM inference

Somesh Mehra; Javier Alonso Garcia; Lukas Mauch

On multi-token prediction for efficient LLM inference

Somesh Mehra, Javier Alonso Garcia, Lukas Mauch

Published: 05 Mar 2025, Last Modified: 23 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: transformers, LLMs, efficient inference, multi-token prediction

TL;DR: A systematic investigation of multi-token prediction applied to LLMs trained for next-token prediction

Abstract: We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 50

Loading