"COMPLEXITY-DEEP: A Language Model Architecture with Mu-Guided Attention and Token-Routed MLP"

04 Feb 2026 (modified: 03 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present COMPLEXITY-DEEP, a language model architecture with three contributions: (1) Token-Routed MLP with Zipf-balanced greedy bin-packing, a deterministic per-token routing that distributes tokens to experts with perfect load balance (1.0000×) and zero token drop, eliminating learned routers and auxiliary losses; (2) Mu-Guided Attention, where a latent state $\mu$ flows forward between layers to bias K, Q, and V projections, creating an inter-layer communication channel carrying expert-aware context; and (3) Shared Lexical Expert, a dense MLP shared across all tokens capturing universal patterns while routed experts specialize. We provide formal theoretical analysis proving perfect load balance under Zipf bin-packing, capacity equivalence with dense models ($\mathcal{F}_{TR} = \mathcal{F}_{dense}$) at $1/n$ compute cost, and gradient-driven expert orthogonalization. Component ablation at 187M scale (500M tokens, 4 runs) shows Token-Routed outperforms both the dense baseline ($-0.112$ avg loss) and Mixtral-style learned-router MoE ($-0.050$). Scaling comparison at 384M (8B tokens, iso-parameters) shows the loss gap narrows from $+0.28$ to $+0.09$ under AdamW, with progressive expert specialization confirmed by t-SNE analysis. Zero-shot benchmarks (ARC-Easy 43.6%, HellaSwag 28.7%) trail the dense baseline by only 1-2 points. Deployed on vLLM with CUDA graphs, the 384M model achieves 4,900 tok/s on a single RTX PRO 6000 (96GB), with only 40% throughput reduction despite 2× parameters. We propose AdamTR, a routing-aware optimizer extension to close the remaining gap.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Major revision with 384M scaling results and architectural fixes: 1. Added 384M iso-parameter scaling comparison (8B tokens, 15,259 steps): Token-Routed (383.5M) vs Dense SwiGLU (384.5M) with identical hyperparameters. Loss gap narrows from +0.28 to stable +0.09. 2. Zero-shot benchmarks on both models: ARC-Easy (43.6% vs 45.9%), HellaSwag (28.7% vs 30.1%), MMLU (23.0% vs 23.1%). 3. vLLM inference benchmark for 384M model: 4,900 tok/s sustained throughput with CUDA graphs on RTX PRO 6000. 4. Fixed critical bug: builder.py was overriding Zipf routing with modulo sort_idx. Restored original loop dispatch matching supplementary code. 5. Added Theorem 4.1 (modulo balance) as baseline, upgraded Theorem 4.2 to Zipf bin-packing with formal bound. 6. Added zero token drop guarantee (contrasted with top-K MoE token dropping). 7. Proposed AdamTR optimizer: theoretical analysis of gradient variance asymmetry in MoE ($E\times$ noisier per-expert gradients under AdamW), with per-expert LR scaling and gradient normalization. 8. New t-SNE visualizations (2D, 3D, interactive) from 384M model showing expert specialization at scale. 9. Updated architecture diagram to 187M (corrected from 176M). 10. All previous revision items maintained (Mixtral baseline, No-Mu ablation, theorem fixes, PiD removal).
Assigned Action Editor: ~Tal_Schuster1
Submission Number: 7327
Loading