Abstract: Nowadays, large language models (LLMs) are expected to process inputs of unprecedented length, with current token limits extending to several million tokens. To meet the long-sequence demands of LLMs, fine-tuning frameworks must also support post-training on extended sequences. Based on the LLaMA-Factory framework, we implemented multiple sequence parallelism (DeepSpeed-Ulysses and Ring-Attention), provided feasible support for sequence parallelism of long sequences. Meanwhile, we extended DeepSpeed-Ulysses by adding dummy heads to handle cases where the number of attention heads is not divisible by the sequence parallel size. At the same time, we conducted an in-depth analysis of the practical issues and potential errors of applying sequence parallelism to post-training. Finally, we experimentally validated the correctness of our sequence parallelism implementation and demonstrated the efficiency of our Dummy-Head Ulysses. We also compared different sequence parallel strategies in terms of maximum sequence length and runtime efficiency. Our code is open at https://anonymous.4open.science/r/SP-LLaMA-Factory-B8B1.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: common to all languages
Submission Number: 1686
Loading