LiteByte: Efficient and Fast-Adapting MLPs for Online Byte-Level Prediction

Published: 10 Jun 2025, Last Modified: 15 Jul 2025MOSS@ICML2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Byte-Level Modeling, Online Learning, MLP Architecture, Attention-Free, Soft Expert Routing, Autoregressive Prediction
Abstract: Transformer-based architectures have become the de facto standard for sequence modeling, largely due to their scalability and ability to capture long-range dependencies. However, their high computational cost, reliance on long contexts, and limited adaptability under online updates make them less suitable for small-scale or streaming scenarios. In this paper, we revisit MLP-based models for byte-level next-token prediction under fully online training. We propose a simple yet effective architecture, LiteByte, which is composed of alternating feedforward layers and soft-shared expert projections, without attention or recurrence. Each sample is dynamically routed through a learned mixture of compact shared MLPs, enabling adaptive token-wise transformations with minimal overhead. Despite its simplicity, our model achieves significantly faster convergence and lower perplexity than Transformer, RNN, and vanilla MLP baselines on Enwik8, Text8, and a curated Dickens corpus. It also demonstrates superior runtime efficiency in terms of inference latency and throughput. We further argue that the soft expert mechanism introduces a reusable and modular structure that may serve as a lightweight adapter or differentiable controller in broader applications such as LoRA-style fine-tuning or modular agents.
Code: ipynb
Submission Number: 19
Loading