How to Train a Model on a Cheap Cluster with Low Cost using Block Coordinate Descent

How to Train a Model on a Cheap Cluster with Low Cost using Block Coordinate Descent

ACL ARR 2025 February Submission324 Authors

06 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLMs training process has high memory demands and high economic cost, making it challenging for many organizations to adopt and scale effectively. In this paper, we train the model using block coordinate descent(BCD) on cheap RTX 4090 clusters, combining with engineering improvements to train LLM with lower economic cost and lower memory demands. In BCD training process, only a subset of parameters is updated, significantly reducing the memory requirements. Through experiments, we show that 1. for a wide range of models and datasets, BCD is capable of training models with the same level of accuracy as traditional method. 2. Averagely, BCD outperforms the OffLoad's 42.0\% in training time cost with the same computation resources. BCD matches distributed training speed using just half of resources. 3. BCD training economic cost has been reduced more than 53.6\% compared to traditional methods on the 4090 cluster, and more than 74.9\% compared to traditional methods on the A100 cluster averagely.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: low cost

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 324

Loading