Efficient Long Context Fine-tuning with Chunk Flow

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Efficient Long Context Fine-tuning with Chunk Flow
Abstract: Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.
Lay Summary: Long context fine-tuning is essential to extend the ability of large language models (LLMs) to handle long texts. This process involves training on meticulously gathered datasets that are predominantly composed of short sequences and a small proportion of longer sequences (e.g., 99% of texts are short, 1% are extremely long). However, existing training methods overlook this characteristic and employ training strategies designed for long sequences, resulting in sub-optimal training efficiency. To solve this, we propose ChunkFlow, a chunk-centric training method designed for long context fine-tuning scenario. Chunkflow reorganizes input sequences into uniformly sized chunks by combining short texts and splitting long ones to achieve balanced workloads and optimal computational efficiency. Additionally, ChunkFlow also employs a state-aware chunk scheduling mechanism to ensure the peak memory usage controllable, which is primarily determined by the length of pre-defined chunk rather than the longest sequence in training dataset. Experiments show ChunkFlow speeds up long-context fine-tuning by up to 4.53x compared to state-of-the-art system Megatron-LM. ChunkFlow significantly accelerates the process of long context fine-tuning, benefiting wide downstream applications from code generation to complex question-answering.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: Long context fine-tuning, Large language model
Submission Number: 2130
Loading