MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu; Yanxuan Yu; Xuhong Wang; Ben Lengerich; Ying Nian Wu

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu, Yanxuan Yu, Xuhong Wang, Ben Lengerich, Ying Nian Wu

Published: 10 Jun 2025, Last Modified: 28 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient ML; Efficient Computing; Long Sequence Modeling

Abstract: As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches—local, session, and long-term—and learns to route attention across them dynamically. We further introduce FastMKA, a broadcast-routed variant that fuses memory sources before attention computation for enhanced efficiency. Experiments on different sequence lengths show that MKA improves perplexity over MHA and MLA, while FastMKA achieves comparable accuracy to MLA with up to 4$\times$ faster training and 40\% lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.

Submission Number: 1

Loading