Stable Language Model Pre-training by Reducing Embedding Variability

Published: 2024, Last Modified: 13 May 2025EMNLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability is impractical due to high computational costs. We study Token Embedding Variability as a simple proxy to estimate pre-training stability. We theoretically and empirically demonstrate that Multi-head Low-Rank Attention acts as a fundamental approach to reducing instability. This is supported by empirical findings on variants on GPT-2, demonstrating improved stability and lower perplexities, even at deeper layer counts.
Loading