SimReg: Achieving Higher Convergence and Generalization in the LLM Pretraining via Embedding Similarity Regularization
Keywords: Embedding similarity, cross entropy, pretraining
TL;DR: We introduce embedding similarity supervision to assist the cross-entropy loss and accelerate large-scale pre-training of LLMs.
Abstract: Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our comprehensive theoretical analysis elucidates how SimReg improves both classification margins and generalization in the pretraining stage. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30\% and improves average zero-shot downstream performance by over 1\% across standard benchmarks. Further ablation and analysis provide practical recommendations for hyperparameter selection and loss application, offering constructive insights for efficient pretraining of LLMs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 23647
Loading