Scaling Embedding Layers in Language Models

Da Yu; Edith Cohen; Badih Ghazi; Yangsibo Huang; Pritish Kamath; Ravi Kumar; Daogao Liu; Chiyuan Zhang

Scaling Embedding Layers in Language Models

Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: embedding layer scalability, contextualized token embeddings, scaling with fixed accelerator usage

TL;DR: We introduce contextualized n-gram embeddings to extend input embedding layers, improving performance while maintaining fixed accelerator usage during inference.

Abstract: We propose SCONE (**S**calable, **C**ontextualized, **O**ffloaded, **N**-gram **E**mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. SCONE enables two new scaling strategies: increasing the number of $n$-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 11133

Loading