Keywords: sequence learning, recurrent learning, streaming learning
Abstract: We re-evaluate the suitability of the independent and identically distributed (IID) training paradigm for sequence learning, where long data streams are segmented into shorter and shuffled chunks, thereby breaking their natural continuity and undermining long-range credit assignment. This paper offers multi-stream sequence learning, a training framework that presents multiple data streams in their natural order. To support this framework, we propose Memora, a recurrent-only architecture whose persistent hidden states make it more suitable for sequence learning than Transformers. Memora builds on Gated Linear Recurrent Unit (GLRU)---a new lightweight recurrent unit designed for efficient parallel training and robust temporal reasoning---and achieves effective learning on long byte-level sequences. Our experiments on structured and byte-level benchmarks demonstrate that models trained under the multi-stream sequence learning framework consistently outperform standard recurrent and state-space models trained with IID training setting, underscoring the importance of preserving continuity in sequence learning.
Submission Number: 117
Loading