Latent-Condensed Transformer for Efficient Long Context Modeling

Latent-Condensed Transformer for Efficient Long Context Modeling

ACL ARR 2026 January Submission695 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Attention, Large Language Models

Abstract: Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to **2.5$\times$** prefilling speedup and **90\%** KV cache reduction at 128K context while maintaining competitive performance.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency, sparse models

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 695

Loading