CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference acceleration, prefill optimization, token ranking, attention aggregation, long-context processing, KV cache compression, oracle
TL;DR: We use an answer-informed oracle to show that cross layer attention aggregation (CLAA) produces more stable token rankings for accelerating LLM prefill on long contexts.
Abstract: The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying significantly between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by aggregating attention from generated answers back to the prompt. Using this oracle, we find that existing heuristics (GemFilter and FastKV) exhibit substantial instability across layers, motivating our proposed heuristic of Cross-Layer Attention Aggregation (CLAA). CLAA robustly aggregates token scores across multiple consecutive layers, significantly improving ranking stability and achieving a new state-of-the-art performance. On LongBench, CLAA reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline. At a similar level of task accuracy, CLAA provides this speedup while being over 10\% faster than the prior state-of-the-art, FastKV, demonstrating a superior accuracy-speed tradeoff.
Primary Area: generative models
Submission Number: 11531
Loading