MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation

MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation

04 May 2026 (modified: 13 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce MANAR, a linear-time attention layer that can directly inherit weights from a pretrained Transformer's multi-head attention (MHA) — a property that distinguishes it from existing linear-time alternatives such as Mamba, RetNet, and Linear Attention, which require training from scratch and therefore forfeit access to the representational capital accumulated in large pretrained Transformers. MANAR augments MHA with a trainable external memory and a constant-size Abstract Conceptual Representation (ACR), a design inspired by the global-workspace bottleneck described in cognitive models of perception. The architecture follows a two-stage logic: (i) an integration phase, in which retrieved memory concepts are combined with the input sequence to form the ACR, a compact global state of the input; and (ii) a broadcasting phase, in which the ACR informs the contextualization of each token together with a local context window, replacing all-to-all attention. Routing global information through a constant-sized ACR yields strictly linear time and memory complexity when the local context window is fixed. Because MANAR preserves the semantic roles of the standard MHA projections, knowledge transfer from pretrained transformers reduces to a direct weight-copy, and we show that transferred models recover and then exceed the accuracy of their sources at a fraction of the from-scratch training budget. MANAR also enables non-convex contextualization: outputs can lie outside the convex hull of the input value vectors, a property we measure empirically and that quadratic softmax attention does not exhibit. Across language, vision, and speech, MANAR is competitive with strong baselines (GLUE 85.1, ImageNet-1K 83.9% top-1, LibriSpeech 2.7%, 6.4% WER) while delivering up to 14.8x single-layer latency reduction and 9.3x peak GPU memory reduction at 4,096 tokens versus quadratic MHA.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Joao_Sacramento1

Submission Number: 8760

Loading