MoE-ERAS: Expert Residency Aware Selection

Abhimanyu Rajeshkumar Bambhaniya; Sashankh Chengavalli Kumar; Tushar Krishna

MoE-ERAS: Expert Residency Aware Selection

Abhimanyu Rajeshkumar Bambhaniya, Sashankh Chengavalli Kumar, Tushar Krishna

Published: 30 May 2024, Last Modified: 08 Jun 2024MLArchSys 2024 OralPosterEveryoneRevisionsBibTeXCC BY 4.0

Workshop Track: System for Machine Learning

Presentation: Virtual

Keywords: LLMs, MoE, System for LLMs

Presenter Full Name: Abhimanyu Bambhaniya, Sashankh Kumar

TL;DR: Residency aware routing for accuracy-performance tradeoff for MoE models.

Presenter Email: abambhaniya3@gatech.edu

Abstract: Mixture of Experts models have quickly grown in popularity due to their faster inference and training than dense models of similar capability. Parameter compression and offloading allow the users to run these models on smaller GPU memory (leading to cost savings). However, unpredictability in expert activation results in slower inference for offloaded experts. In this work, we profile and study the expert activation patterns when running large MoE models. Based on insights from activation patterns, we propose a new way of expert selection, which takes the expert residency into account. We introduce \textit{MoE-ERAS}, Expert Residency Aware Selection to select the most suitable experts considering \textbf{both performance and accuracy}. We show substantial gains in decoding latency and expert swaps, and present analysis to show pre-fetching opportunities for future work. MoE-ERAS allows users to choose an acceptable point on the speedup-quality trade-off.

Presenter Bio: Abhimanyu Bambhaniya is a third year PhD student at Georgia Tech. His primary area of research is "Algorithmic Optimization and System Design of LLM inference". He is working with collaborators from Google and Intel on projects for LLM inference acceleration, sparse training recipes, and sparse hardware architecture. Sashankh Chengavalli Kumar is a MS Computer Science student at Georgia Tech with an interest in systems for Machine Learning. He completed his undergraduate degree in Computer Science at the National University of Singapore. He is currently a Machine Learning Engineer Intern at a Santa Clara startup.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.

Workshop Registration: Yes, at least one of the authors has registered for the workshop (Two-Day Registration at minimum).

Submission Number: 18

Loading