Fine-grained Trace-driven Performance Modeling and Simulation for Large-scale ML Training

Published: 30 May 2024, Last Modified: 23 Jun 2024MLArchSys 2024 OralPosterEveryoneRevisionsBibTeXCC BY 4.0
Workshop Track: System for Machine Learning
Presentation: Virtual
Keywords: Performance Modeling, Distributed Training, Simulation, System for Machine Learning
Presenter Full Name: Mingyu Liang
TL;DR: Fine-grained trace-driven performance modeling and simulation framework for large-scale ML training
Presenter Email: ml2585@cornell.edu
Abstract: The widespread adoption of machine learning (ML) models, especially with the emergence of large language models (LLMs), has introduced growing challenges in understanding and optimizing both the models and their deployment systems. Performance modeling plays an essential role to predict model performance across various scenarios and guide optimization techniques. In this work, we present TraceSim, a fine-grained, trace-driven performance modeling and simulation framework. TraceSim captures runtime details of models without any model-specific instrumentation, and constructs a comprehensive execution graph to describe model execution. Evaluation with GPT-3 on a production-scale cluster of 256 GPUs achieves, on average, 95.6\% accuracy in reproducing the end-to-end execution time, and up to 99.5\% accuracy in predicting performance for unseen scaled-up scenarios.
Presenter Bio: Mingyu is a Ph.D. student at Cornell University. His research interest lies in cloud computing, ML systems and computer architecture. More recently, he focuses on the trace-based performance modeling, optimization and benchmarking of ML models.
Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.
YouTube Link: https://youtu.be/vhYEhKq87lQ
Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.
Workshop Registration: Yes, at least one of the authors has registered for the workshop (Two-Day Registration at minimum).
Submission Number: 2
Loading