Remix, Don't Expand: Context-Aware Embedding Routing

Remix, Don't Expand: Context-Aware Embedding Routing

ACL ARR 2026 January Submission647 Authors

23 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Inference, Mixture of Experts, Model Compression, Conditional Computation, Language Modeling, Sparse Modeling, Dynamic Neural Networks, Representation Learning

Abstract: Transformer language models typically use fixed, dense embeddings where all dimensions are equally active regardless of context, leading to parameter inefficiency. We introduce a Context-Aware Embedding Routing framework with three instantiations: Selection from a larger dimensional pool, Remixing via Mixture-of-Experts (MoE) to permute a compact base, and Direct Generative Routing, which synthesizes high-fidelity embeddings from a low-dimensional semantic seed. Both routing-based approaches use context-aware soft attention mechanisms to adapt representations dynamically. Surprisingly, our experiments reveal that while all methods outperform dense baselines, the generative approach is superior at low dimensions. Activating only 64 dimensions via direct synthesis achieves a perplexity of 61.09 on WikiText-103 (vs. 92.35 for the 512-dim baseline), effectively performing "semantic super-resolution" on input tokens. This method achieves these gains with 87.5% fewer parameters and a 6.87x inference speedup, demonstrating that dynamic generation provides a better inductive bias for resource-constrained modeling than static storage or sparse selection.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP, LLM Efficiency, parameter-efficient-training, sparse models, word embeddings, representation learning, polysemy, interpretability

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory

Languages Studied: English

Submission Number: 647

Loading