When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Published: 09 Feb 2025, Last Modified: 09 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. Despite its limitations, BFloat16 remains desirable for its computational efficiency, particularly given the substantial memory overhead required to extend the context window. To improve long-context training under BFloat16, we develop AnchorAttention, a plug-and-play attention method that enhances long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks.
Submission Length: Long submission (more than 12 pages of main content)
Code: https://github.com/haonan3/AnchorContext
Assigned Action Editor: ~Li_Dong1
Submission Number: 3712
Loading