Keywords: sequence learning, recurrent learning, streaming learning, language models
Abstract: Sequence data are inherently dependent, yet sequence learners (e.g., language models) are often trained as if samples were independent and identically distributed (IID) by segmenting long streams into short, shuffled chunks, breaking natural continuity and undermining long-range credit assignment. We formalize multi-stream sequence learning, a continuity-preserving training framework that presents multiple streams in their natural order, a setting that has been conflated with solution methods and remains underexplored. To support this paradigm, we propose Memora, a recurrent-only architecture with persistent hidden states, making it more suitable for sequence learning than architectures trained with IID chunking. Memora is built around our Gated Linear Recurrent Unit (GLRU), a lightweight unit designed for efficient parallel training and robust temporal reasoning. It achieves effective learning on long byte-level sequences and remains reliable even in the strict streaming setting, where data arrive online one byte at a time. Our experiments highlight that continuity-preserving training outperforms IID chunking, underscoring the importance of continuity in sequence learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 22149
Loading