DITRON: A Flexible and Versatile Distributed Tensor Compiler for LLM

Size Zheng; Xuegui Zheng; Hanshi Sun; Shiyu Li; Qi Hou; Jin Fang; Haojie Duanmu; Wenlei Bao; Chenhui Huang; YuanqiangLiu; Renze Chen; Ningxin Zheng; Ziheng Jiang; Dongyang Wang; Jianxi Ye; Li-Wen Chang; Liqiang Lu; Yun Liang; Jidong Zhai; Xin Liu

DITRON: A Flexible and Versatile Distributed Tensor Compiler for LLM

17 Sept 2025 (modified: 19 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Distributed System; Machine Learning Compiler; Compute Communication Overlapping

Abstract: As the performance scaling of individual devices slows down, distributed system acceleration has emerged as the mainstream. With the increasing maturity of parallel optimizations (e.g., tensor parallelism, sequence parallelism, expert parallelism) in recent years, researchers have found that the key to scalability lies in optimizing the overlap between computation and communication. Developing sophisticated overlapping kernels is challenging and exceeds the capabilities of most researchers, hindering the development of model architectures and parallel strategies. To address this issue, we present DITRON, a flexible and versatile distributed compiler that offers high-level programming interfaces for overlapping kernels. DITRON provides programming abstractions at three levels: (1) fine-grained tile-level programming within the scale-up domain; (2) chunk-level data transfer for the scale-out domain; (3) task-level distributed MegaKernel generation for entire LLM. DITRON inherits Triton's programming model, enabling the transformation of original Triton kernels into parallel kernels with minimal modifications to the source code. Evaluation results show that overlapping kernels developed with DITRON are $1.27\times–19.18\times$ faster than non-overlapping versions, and even outperform expert-tuned CUDA libraries by 6%–30%. End-to-end inference yields a 5%–30% speedup over vLLM. Moreover, DITRON has been validated for training tasks on over 10,000 GPUs, proving robust capabilities for large-scale industrial deployment and saving millions of GPU hours monthly.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 9073

Loading

DITRON: A Flexible and Versatile Distributed Tensor Compiler for LLM

Size Zheng, Xuegui Zheng, Hanshi Sun, Shiyu Li, Qi Hou, Jin Fang, Haojie Duanmu, Wenlei Bao, Chenhui Huang, YuanqiangLiu, Renze Chen, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, Xin Liu