Keywords: Efficient inference, Neural memory, Test-time learning
Abstract: We present MoNe, a lightweight modular neural memory that attaches to any frozen
pretrained Transformer to enable long-context inference without retraining.
MoNe reads context in fixed-size segments via test-time learning of fast-weight neural
memory networks with layer-localized gradient updates; at inference, the memory generates keys and values from the query tokens alone, with no context tokens re-read. This two-phase design decouples inference cost from context length, achieving $O(N)$
preprocessing and $O(1)$ query cost with peak GPU memory that does not grow with $N$.
At 128K tokens, MoNe reduces both compute and peak GPU memory by approximately 80\%
compared to ICL with only 6.4\% parameter overhead. MoNe generalizes to context lengths far beyond the
backbone's native window, achieving strong performance on needle-in-a-haystack
and word extraction benchmarks from RULER, where ICL degrades sharply.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 124
Loading