EQUIP: EQUivariant preserving In-Place updates for Efficient Token Pruning

ACL ARR 2026 January Submission10749 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference, Token Pruning, KV cache management, RoPE computation
Abstract: Token-pruning has emerged as a pivotal focus in large language models (LLMs), driven by the need to enhance model efficiency while pre serving accuracy, especially for large sequence lengths. However, the eviction operation of token pruning methods causes “holes” in KV tensors, posing two major challenges: (1) The shift operation required to make the KV tensor contiguous results in significant copy overheads; (2) The changes in position indices due to token eviction leads to increased computational requirements for Rotary Positional Encoding(RoPE). To address these issues, we introduce (EQUIP), an EQUivariant preserving in-place token update mechanism that ensures the equivariance property of the operations performed in the attention computation. EQUIP offers two fundamental advantages: First, it combines eviction and a subsequent token insertion into an in-place replacement operation, which reduces the KV cache copy overheads significantly. Second, EQUIP reduces recomputation of rotation operations through a combination of in-place update, caching and a re-indexing strategy. Together, these optimizations enable EQUIP with StreamingLLM to achieve geomean speedups of 1.62× (or 1.47×) on CPU (GPU) over StreamingLLM, and 3.45× (or 1.86x) on CPU (GPU) over H2O. EQUIP with Paged Attention achieves speedups of 4.18x (2.61x) on CPU (GPU) over auto-regressive032 baselines. EQUIP preserves the same model accuracy as baseline pruning methods.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM inference, Token Pruning, KV cache , RoPE
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 10749
Loading