GPURank: A Cloud GPU Instance Recommendation System

Shravika Mittal; Kanak Mahadik; Ryan A. Rossi; Sungchul Kim; Handong Zhao

GPURank: A Cloud GPU Instance Recommendation System

Shravika Mittal, Kanak Mahadik, Ryan A. Rossi, Sungchul Kim, Handong Zhao

Published: 01 Jan 2024, Last Modified: 20 May 2025IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the advent of cloud platforms that offer GPU-as-a-Service (GPUaaS), such as Amazon EC2 and Microsoft Azure, researchers increasingly rely on virtual GPU instances for training deep learning (DL) workloads. These GPU instances vary in configuration attributes, including but not limited to the number of GPUs, the number of vCPUs, and per-hour usage cost. Identifying the appropriate GPU instance for training a DL workload becomes extremely difficult due to the huge GPU instance selection space offered by the cloud platforms and the corresponding variation in training performance or computational needs of different DL workloads. In this paper, we propose a GPU instance recommendation system called GPURank, which provides a recommended list of GPU instances to choose from for DL workloads. GPURank predicts and leverages two metrics: epoch training cost and average GPU utilization to make this choice. We curated a new benchmark dataset by profiling diverse DL workloads to train the regression models in GPURank’s prediction framework. We demonstrate that GPURank beats baselines on two pertinent problem settings: (1) unseen workloads and (2) unseen GPU instances, with a 25.89% and 20.10% higher average ranking performance on these respectively.

Loading