KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang; Simon Guo; Simran Arora; Alex L Zhang; William Hu; Christopher Re; Azalia Mirhoseini

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, Azalia Mirhoseini

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A benchmark evaluating language models’ ability to generate correct and fast GPU kernels

Abstract: Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce **KernelBench**, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric $\text{fast}_p$, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold $p$ over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20\% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold $p$.

Lay Summary: Modern AI systems demand massive computational power, delivered by dedicated AI hardware such as GPUs. To effectively utilize this hardware, engineers need to write specialized programs called GPU kernels, but the process of developing kernels is extremely difficult and time-consuming due to the deep domain knowledge required. We investigate whether language models could help automatically generate these complex GPU kernels. To test this, we created KernelBench, a comprehensive benchmark of 250 real-world AI workloads that could be accelerated using performant kernels. We found that today’s models struggle significantly with this task, with the best models only matching PyTorch's performance in less than 20% of cases. Just as human expert engineers iteratively refine their code over time, we found that leveraging execution feedback could also help AI improve its generated kernels; however, improvement remains limited and writing efficient GPU kernels still poses a challenge for current AI systems. Progress on KernelBench directly translates to faster, more efficient kernels that could reduce energy consumption and accelerate AI development. Additionally, KernelBench serves as a research environment for improving language models on this challenging and performance-critical code generation task.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/ScalingIntelligence/KernelBench

Primary Area: General Machine Learning->Evaluation

Keywords: Benchmark, GPU Kernel Design, Code Generation

Submission Number: 14542

Loading