Towards a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo; William Won; Meghan Cowan; Nan Jiang; Benjamin Klenk; Srinivas; Tushar Krishna

Towards a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo, William Won, Meghan Cowan, Nan Jiang, Benjamin Klenk, Srinivas, Tushar Krishna

Published: 30 May 2024, Last Modified: 07 Jun 2024MLArchSys 2024 OralPosterEveryoneRevisionsBibTeXCC BY 4.0

Workshop Track: Architecture 2.0

Presentation: In-Person

Keywords: Distributed Machine Learning, Collective Algorithm Representation, Collective Communication, Format Standardization

Presenter Full Name: Jinsun Yoo

TL;DR: We propose a standardized representation of collective algorithms in distributed ML to gap the bridge between workload information and various collective algorithm tools.

Presenter Email: jinsun@gatech.edu

Abstract: The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithm and running collective communication, which acts as a bottleneck to distributed machine learning. However, the lack of a standardized collective algorithm representation has hindered interoperability between the workload representation, collective algorithm producers, and consumers. The trend of collective algorithm producers and consumers using their own representation has pushed away from co-optimizing collective communications and the rest of the workload. Additionally, tool-specific conversions and modifications have to be made for each pair of tools producing and consuming collective algorithms. In this paper, we propose a standardized workflow leveraging a common collective algorithm representation. Upstream producers and downstream consumers converge to a common representation format based on Chakra Execution Trace, which is being used to represent distributed machine learning workloads. Such a common representation enables to view collective communications at the same level as workload operations and decouple producer and consumer tools, enhance interoperability, and relieve the user from the burden of having to focus on downstream implementations. We provide a proof-of-concept of this standardized workflow by simulating collective algorithms generated by MSCCLang domain-specific language through ASTRA-sim distributed machine learning simulator using various network configurations.

Presenter Bio: Jinsun Yoo is a 3rd year PhD student in Georgia Tech, advised by Profs. Kishore Ramachandran and Tushar Krishna. His research is on system support for novel computing paradigms, spanning across Distributed ML and Edge computing.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

Workshop Registration: Yes, at least one of the authors has registered for the workshop (Two-Day Registration at minimum).

Submission Number: 15

Loading