"COMPLEXITY-DEEP: A Language Model Architecture with Mu-Guided Attention and Token-Routed MLP"

TMLR Paper7327 Authors

04 Feb 2026 (modified: 09 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present COMPLEXITY-DEEP, a language model (LLM) architecture developed from scratch, introducing three original contributions: (1) Token-Routed MLP, a dynamic per-token routing mechanism inspired by Mixture of Experts but without requiring auxiliary load balancing loss, (2) Mu-Guided Attention, where a latent state μ from the previous layer guides K, Q, and V projections, creating a bidirectional information flow between attention and dynamics, and (3) a PiD-style adaptive controller that stabilizes training through dynamic scaling. We provide formal theoretical analysis proving perfect load balance, capacity equivalence with dense models at 1/n compute cost, gradient-driven expert orthogonalization, and establish connections between Mu-Guidance and predictive coding theory. Our 1.5B parameter implementation, trained on 33B tokens from FineWeb-Edu, demonstrates the viability of this architecture with stable convergence (loss 3.78, perplexity 43.7). Evaluation on standard benchmarks shows performance consistent with model size, with supervised fine-tuning achieving 30% on MMLU (+5% above random) and 23% on ARC-Challenge.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Added t-SNE visualization of expert activations (Figure 1) showing functional specialization despite deterministic modulo routing. Added Section 7.2.1 discussing trade-offs between Token-Routed MLP and dynamic Top-k routing (Switch Transformer, Mixtral). Translated all supplementary figure captions and labels from French to English. Dense 1.5B baseline (same 33B FineWeb-Edu tokens) and proxy-scale training-time ablations (100M–500M) are currently training and will be provided in a follow-up revision.
Assigned Action Editor: ~Tal_Schuster1
Submission Number: 7327
Loading