Scaling Embedding Layers in Language Models

22 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We scale embedding layers by introducing precomputed and offloaded n-gram embeddings, improving performance while maintaining fixed inference-time FLOPS.
Abstract: We propose SCONE (**S**calable, **C**ontextualized, **O**ffloaded, **N**-gram **E**mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
Primary Area: General Machine Learning->Scalable Algorithms
Keywords: Embedding layer scalability, contextualized token embeddings, scaling with fixed inference budget
Submission Number: 7462
Loading