DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference

Published: 05 Mar 2024, Last Modified: 12 May 2024ICLR 2024 AGI Workshop OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference,Tree-based Decoding, Memory Efficiency, Tree Attention
TL;DR: We propose DeFT, an IO-aware tree attention algorithm optimizing attention calculations for sequence-granular decoding trees, resulting in improved efficiency and reduced I/O overhead.
Abstract: Decoding using tree search can greatly enhance the inference quality for transformer-based Large Language Models (LLMs). Depending on the guidance signal, it searches for the best path from root to leaf in the tree by forming LLM outputs to improve controllability, reasoning ability, alignment, et cetera. However, current tree decoding strategies and their inference systems do not suit each other well due to redundancy in computation, memory footprints, and memory access, resulting in inefficient inference. To address this issue, we propose DeFT, an IO-aware tree attention algorithm that maintains memory-efficient attention calculation with low memory footprints in two stages: (1) QKV Preparation: we propose a KV-Guided Tree Split strategy to group QKV wisely for high utilization of GPUs and reduction of memory reads/writes for the KV cache between GPU global memory and on-chip shared memory as much as possible; (2) Attention Calculation: we calculate partial attention of each QKV groups in a fused kernel then apply a Tree-topology-aware Global Reduction strategy to get final attention. Thanks to a reduction in KV cache IO by 3.6-4.5x, along with an additional reduction in IO for QK^T and Softmax equivalent to 25% of the total KV cache IO, DeFT can achieve a speedup of 1.7-2.4x in end-to-end latency across two practical reasoning tasks over the SOTA attention algorithms.
Submission Number: 40
Loading