Language Modeling with Learned Meta-Tokens

Published: 10 Jun 2025, Last Modified: 28 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: meta-tokens, language models, pre-training, positional encoding
TL;DR: We pre-train language models with inserted meta-tokens, demonstrating strong performance and length generalization on synthetic tasks for long-context modeling. We explain these results based on positional encoding and implicit context compression.
Abstract: While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to captures long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention over less than 100B tokens, achieving strong performance on a suite of synthetic tasks. We suggest that these gains arise due to the meta-tokens \textit{sharpening} the positional encoding, operating as content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into their behavior towards length generalization.
Submission Number: 41
Loading