Keywords: transformers, long-term-memory, sequential processing, lifelong learning
Abstract: Developing sequential models traditionally involves two stages - training and application. Retention of information acquired after training (at application time) is architecturally limited by the size of the model's context window (in the case of transformers), or by the practical difficulties associated with long sequences (in the case of RNNs). In this paper, we propose a novel transformer-based Updater-Extractor architecture that can work with sequences of arbitrary length and refine its long-term knowledge about the world based on inputs at application time. We explicitly train the model to incorporate incoming information into its world state representation, obtaining strong inductive generalization and the ability to handle extremely long-range dependencies. We propose a novel one-step training procedure that makes such training feasible, and prove a lemma that provides theoretical justification for this training procedure. Empirically, we investigate the model performance on a variety of different tasks: we use two new simulated tasks tasks to study the model's ability to handle extremely long-range dependencies, we demonstrate competitive performance on the challenging Pathfinder problem using vanilla attention.
One-sentence Summary: Proposing a theoretically and practically justified way to introduce persistent world state representations into transformer architectures.
Supplementary Material: zip
5 Replies
Loading