Abstract: As deep learning grows rapidly, model training heavily relies on parallel methods and there exist numerous cluster configurations. However, current preferences for parallel training focus on data centers, overlooking the financial constraints faced by most researchers. To attain the best performance within the cost limitation, we introduce a throughput-cost metric to accurately characterize clusters' cost-effectiveness. Based on this metric, we design a cost-effective cluster featuring the 3090 with NVLink. The experiment results demonstrate that our cluster achieves remarkable cost-effectiveness in various distributed model training schemes.
Loading