Leveraging Large Language Models to Enhance Machine-Learning-Driven HPC Job Scheduling

Leveraging Large Language Models to Enhance Machine-Learning-Driven HPC Job Scheduling

NeurIPS 2025 Workshop MLForSys Submission24 Authors

Published: 30 Oct 2025, Last Modified: 15 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, HPC, job scheduling, runtime prediction

Abstract: High-performance computing (HPC) systems rely on job schedulers like Slurm to allocate compute resources to submitted workloads. Recently, machine learning models have been used to predict job runtimes which can be used by schedulers to optimize utilization. However, many of these models struggle to effectively encode string-type job features, typically relying on integer-based Label or One-hot encoding methods. In this paper, we use Transformer-based large language models, particularly Sentence-BERT (SBERT), to semantically encode job features for regression-based job runtime prediction. Using a 90,000-record 169-feature Slurm dataset we evaluate four SBERT variants and compare them against traditional encodings using four regression models. Our results show that SBERT-based encodings—especially using the all-MiniLM-L6-v2 model—substantially outperform conventional methods, achieving an r2 score up to 0.88; 2.3× higher than traditionally-used Label encoding. Moreover, we highlight practical trade-offs, such as model memory size versus accuracy, to guide the selection of efficient encoders for production HPC systems.

Submission Number: 24

Loading