Surrogate Modeling for HPC Application Iteration Times Forecasting with Network Features

Published: 01 Jan 2024, Last Modified: 07 Feb 2025SIGSIM-PADS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Interconnect networks are the foundation for modern high performance computing (HPC) systems. Parallel discrete event simulation (PDES), serving as a cornerstone in the study of large-scale networking systems by modeling and simulating the real-world behaviors of HPC facilities, faces escalating computational complexities at an unsustainable scale. The research community is interested in building a surrogate-ready PDES framework where an accurate surrogate model can be used to forecast HPC behaviors and replace computationally expensive PDES phases. In this paper, we focus on forecasting application iteration times, the key indicator of large-scale networking performance, with network features, such as bandwidth-consumed and busy time on routers. We introduce five representative methods, including LAST, Average, ARIMA, LSTM, and the proposed framework LSTM-Feat, to forecast the iteration times of an exemplar application MILC running on a dragonfly system. By incorporating network features, LSTM-Feat can understand dependencies between network features and iteration times, thus facilitating forecasts. The experiments demonstrate the effectiveness of incorporating network features into surrogate models and the potential of surrogate models to accelerate PDES.
Loading