ParallelKernelBench: Can LLMs Write Fast Multi-GPU Kernels?

Willy Chan; Nathan Paek; Simon Guo; Simran Arora; Daniel Y Fu

ParallelKernelBench: Can LLMs Write Fast Multi-GPU Kernels?

Willy Chan, Nathan Paek, Simon Guo, Simran Arora, Daniel Y Fu

Published: 16 Jun 2026, Last Modified: 16 Jun 2026ICML 2026 Workshop DL4CEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Benchmark, Parallelism, GPU Kernel Design, Code Generation, Distributed Programming

TL;DR: Evaluating language models’ ability to generate multi-GPU kernels with 87 workloads covering parallelism strategies across the training and inference stack.

Abstract: There is growing interest in using large language models (LLMs) to write high-performance GPU kernels, with recent work showing promising results on single-GPU workloads. However, multi-GPU kernel generation remains unexplored, despite communication emerging as a dominant bottleneck in large-scale training and inference. In this paper, we study how well LLMs can write multi-GPU kernels, a task compounded by three challenges: (1) the design space is combinatorially large, as training and inference workloads can be parallelized across tensor, expert, pipeline, data, and sequence dimensions; (2) single-GPU memory-compute roofline analysis fails to capture communication bottlenecks in multi-GPU execution; and (3) the many hardware paths available for communication (e.g., copy engine, TMA, SM instructions) each carry distinct tradeoffs. We introduce ParallelKernelBench (PKB), a benchmark and evaluation framework for multi-GPU kernel generation. Additionally, we construct a taxonomy of distributed workloads spanning different parallelism types and select 87 problems covering compositions that arise in real workloads. We contribute evaluations of frontier coding models on PKB, finding that current LLMs struggle: single-shot kernel correctness plateaus at 32% of cases, with speedup exceeding an unoverlapped baseline (PyTorch + NCCL) in only 25% of cases. We also contribute a communication-aware roofline analysis of correct kernels, finding that over 90% of baselines achieve less than 50% of peak hardware utilization. Optimized solutions to many PKB workloads are largely absent from existing open-source repositories; we highlight several LLM-generated net new kernels that outperform their reference implementations in specific regimes, including NeMo's vocab-parallel log-probability kernel (up to 2.06x), Hyena CP (up to 1.72x), and SAM3 IoU suppression (up to 1.40x).

Submission Number: 46

Loading