Keywords: Fast weights; Test time training; Sparse Memory
TL;DR: FwPKM equips language models with a sparse fast-weight episodic memory that can be updated online, enabling efficient long-context retrieval and rapid continual adaptation without retraining all model weights.
Abstract: Foundation models need mechanisms for rapid continual adaptation without repeatedly updating all slow weights. We introduce Fast-weight Product Key Memory (FwPKM), a sparse fast-weight memory layer for language modeling. FwPKM transforms Product Key Memory from a static slow-weight module into an online-updated episodic memory: it keeps the sparse PKM retrieval process, but updates activated key--value parameters at both training and inference time using chunk-level gradient descent on a local memory-rewrite objective. This performs Test-Time Training (TTT)-style updates over a sparse memory, enabling many context-specific associations to be memorized and retrieved with fixed per-token compute. Experiments show that FwPKM complements standard slow-weight modules, improves long-context perplexity, and generalizes to 128K-token Needle-in-a-Haystack contexts despite being trained on only 4K-token sequences. In online domain adaptation, FwPKM adapts quickly but exposes retention challenges, motivating future work on memory consolidation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 41
Loading