Keywords: DoE, knowledge block, expert decoupling
TL;DR: Our DoE architecture uses a two-stage (LDA->VAE) process to create dynamic 'Knowledge Block' experts from attention's K/V matrices. It replaces MoE routers with an attention-based gate (AGC), achieving layer-wise specialization and efficient scaling.
Abstract: Current large language models (LLMs), particularly Mixture-of-Experts (MoE) variants, face challenges in achieving efficient, structured, and interpretable scaling. We introduce the Decoupling of Experts (DoE) architecture, a novel framework that addresses these limitations by grounding computation in a hierarchically organized and dynamically updated knowledge space. Our methodology features a two-stage lifecycle: we first use Latent Dirichlet Allocation (LDA) to build a semantic topic foundation from the training corpus. This knowledge is then integrated into the main LLM, where it is dynamically refined. Critically, we discard traditional, static MoE experts. Instead, the expert entity is a dynamic \textbf{Knowledge Block} synthesized on-the-fly by reusing the Key and Value matrices from the attention computation. We replace the standard load balancer and softmax gating with an \textbf{Attention Gating Control (AGC)} that employs a VAE-based router with a ReLU activation for expert composition. This entire process is optimized with a composite loss function, balancing next-token prediction with a KL-divergence-based expert loss. Our analysis reveals that this architecture induces a remarkable \textbf{heterogeneous specialization} across layers, with some layers differentiating into "science" and "humanities" domains, while others converge on general functions. This demonstrates a learned, hierarchical division of labor, paving the way for a new, more efficient scaling dimension based on the number of structured experts.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13711
Loading