Keywords: Large language models, HPC, job scheduling, runtime prediction
Abstract: High-performance computing (HPC) systems rely on job schedulers like Slurm to
allocate compute resources to submitted workloads. Recently, machine learning
models have been used to predict job runtimes which can be used by schedulers
to optimize utilization. However, many of these models struggle to effectively
encode string-type job features, typically relying on integer-based Label or One-hot
encoding methods. In this paper, we use Transformer-based large language models,
particularly Sentence-BERT (SBERT), to semantically encode job features for
regression-based job runtime prediction. Using a 90,000-record 169-feature Slurm
dataset we evaluate four SBERT variants and compare them against traditional
encodings using four regression models. Our results show that SBERT-based
encodings—especially using the all-MiniLM-L6-v2 model—substantially outperform
conventional methods, achieving an r2 score up to 0.88; 2.3× higher than
traditionally-used Label encoding. Moreover, we highlight practical trade-offs,
such as model memory size versus accuracy, to guide the selection of efficient
encoders for production HPC systems.
Submission Number: 24
Loading