Keywords: sequence model, long-context sequence modeling, Transformer, softmax attention, linear attention, RNN, language model
TL;DR: We propose the Forgetting Transformer, a Transformer variant with a forget gate.
Abstract: An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name the resulting model the Forgetting Transformer. We show that the Forgetting Transformer significantly outperforms the standard Transformer on long-context language modeling and downstream tasks. Moreover, the Forgetting Transformer does not require any position embeddings and generalizes beyond the training context length. Several analyses, including the needle-in-the-haystack experiment, show that the Forgetting Transformer also retains the standard Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13091
Loading