Low Rank Experts Enable Specialization In Dense Transformers

18 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformers, moe, llms
TL;DR: We present routed low rank experts, a drop-in augmentation for standard Transformer MLPs that improves model performance.
Abstract: We present Low Rank Experts (LoREs), a drop‑in augmentation for standard Transformer MLPs that improves model performance. Each MLP hosts a router that selects a token‑specific top‑$k$ subset from a bank of low‑rank matrices. Their combined contribution is injected into the up‑projection, yielding a dynamic, per‑token rank-$k$ update to the base weight and executed in parallel with the up-projection via grouped GEMMs. To compare with dense baselines, we match the parameter budget of LoRE augmented transformers by shrinking the base expansion factor to offset the router and low-rank experts' parameters. Overall FLOPS decrease because the low‑rank branch is sparsely activated. LoREs delivers token-level specialization by routing each token to structured experts inside a single dense MLP. We observe consistent quality improvements on benchmark tasks for models with upto 1.6B parameters trained on 400B tokens.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14437
Loading