Keywords: Efficient Inference, Mixture of Experts, Model Compression, Conditional Computation, Language Modeling, Sparse Modeling, Dynamic Neural Networks, Representation Learning
Abstract: Transformer language models typically use fixed, dense embeddings where all dimensions are equally active regardless of context, leading to parameter inefficiency. We introduce a Context-Aware Embedding Routing framework with three instantiations: Selection from a larger dimensional pool, Remixing via Mixture-of-Experts (MoE) to permute a compact base, and Direct Generative Routing, which synthesizes high-fidelity embeddings from a low-dimensional semantic seed. Both routing-based approaches use context-aware soft attention mechanisms to adapt representations dynamically. Surprisingly, our experiments reveal that while all methods outperform dense baselines, the generative approach is superior at low dimensions. Activating only 64 dimensions via direct synthesis achieves a perplexity of 61.09 on WikiText-103 (vs. 92.35 for the 512-dim baseline), effectively performing "semantic super-resolution" on input tokens. This method achieves these gains with 87.5% fewer parameters and a 6.87x inference speedup, demonstrating that dynamic generation provides a better inductive bias for resource-constrained modeling than static storage or sparse selection.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP, LLM Efficiency, parameter-efficient-training, sparse models, word embeddings, representation learning, polysemy, interpretability
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 647
Loading