Abstract: Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating “what-if” hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.
Topics: Benchmarks, Datasets, and Evaluation: Benchmarks for training, inference, and efficiency, Benchmarks, Datasets, and Evaluation: Testing, debugging, monitoring, and reproducibility of ML applications, Benchmarks, Datasets, and Evaluation: Visualization of data, models, and predictions, Model Serving: System optimizations for model serving
Submission Number: 8
Loading