Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization

Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui

Published: 17 Jun 2025, Last Modified: 25 Jan 2026Proceedings of the ACM on Management of DataEveryoneRevisionsCC BY-SA 4.0
External IDs:doi:10.1145/3725322
Loading