$\texttt{LIME}$: Making LLM Data More Efficient with Linguistic Metadata Embeddings

ICLR 2026 Conference Submission19080 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, LLM, Metadata Embeddings, Pre-Training, Tokenization, Data Efficiency, Linguistic
Abstract: Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose $\texttt{LIME}$ ($\textbf{Li}$nguistic $\textbf{M}$etadata $\textbf{E}$mbeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. $\texttt{LIME}$ substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, $\texttt{LIME}$ improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, $\texttt{LIME$^{+1}$}$, that can guide token generation. Given prior metadata for the next token, $\texttt{LIME$^{+1}$}$ improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19080
Loading