Yet Another Scaling Axis with Some Free Lunch: Enlarging Token-indexed Parameters

Yet Another Scaling Axis with Some Free Lunch: Enlarging Token-indexed Parameters

ICLR 2026 Conference Submission6692 Authors

16 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, compute-efficient, residual modulation, layer embedding, MoE, scaling law

Abstract: The scaling laws of large language models have driven remarkable progress, yet they reveal a fundamental bottleneck: performance gains from adding parameters diminish while computational costs grow near-linearly. To overcome this limitation, we introduce a new scaling axis—token-indexed parameters—that decouples model capacity from computational cost. We present ReToken and Mixture of ReToken (MoRT), which augment transformer layers with token-specific modulation vectors retrieved from learned embedding tables. These vectors are applied through table lookups and element-wise operations, adding negligible FLOPs compared to the backbone's $O(d^2)$ complexity per token. Across dense and MoE backbones (190M–3B parameters), our method consistently reduces training loss and substantially improves downstream task accuracy while maintaining unchanged inference FLOPs and latency. Our scaling law analysis reveals that MoRT introduces a multiplicative improvement factor to the loss function, fundamentally shifting the quality-compute Pareto frontier, achieving equivalent model quality with 20-30\% less compute compared to baseline MoE models. These results establish token-indexed parameters as an effective scaling dimension for continued LLM advancement without proportional compute increases.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6692

Loading