Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Published: 05 Mar 2025, Last Modified: 14 Apr 2025SCOPE - ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny paper track (up to 2 pages excluding references and appendix)
Keywords: MEDUSA, multiple heads decoding, tree attention, dynamic construction
Abstract:

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 69
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview