Memorization in Attention-only Transformers
TL;DR: We extend Transformer memorization to any context size, improving exact and approximate memorization with stronger theoretical and experimental validation.
Abstract: Recent research has explored the memorization capacity of multi-head attention, but these
findings are constrained by unrealistic limitations on the context size. We present a novel proof
for language-based Transformers that extends the current hypothesis to any context size. Our
approach improves upon the state-of-the-art by achieving more effective exact memorization
with an attention layer, while also introducing the concept of approximate memorization of
distributions. Through experimental validation, we demonstrate that our proposed bounds
more accurately reflect the true memorization capacity of language models, and provide a precise
comparison with prior work.
Submission Number: 1126
Loading