Abstract: LLMs training process has high memory demands and high economic cost, making it challenging for many organizations to adopt and scale effectively. In this paper, we train the model using block coordinate descent(BCD) on cheap RTX 4090 clusters, combining with engineering improvements to train LLM with lower economic cost and lower memory demands. In BCD training process, only a subset of parameters is updated, significantly reducing the memory requirements. Through experiments, we show that 1. for a wide range of models and datasets, BCD is capable of training models with the same level of accuracy as traditional method. 2. Averagely, BCD outperforms the OffLoad's 42.0\% in training time cost with the same computation resources. BCD matches distributed training speed using just half of resources. 3. BCD training economic cost has been reduced more than 53.6\% compared to traditional methods on the 4090 cluster, and more than 74.9\% compared to traditional methods on the A100 cluster averagely.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: low cost
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 324
Loading