Keywords: CUDA Optimization, Reinforcement Learning, LLMs
Abstract: The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies.
While recent advances in LLMs show promise for code generation, current state-of-the-art models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization that employs a novel contrastive RL algorithm.
CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of {\bf ×3.12} with a median speedup of {\bf ×1.42} against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching {\bf ×120}. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates {\bf ×2.77} over Torch Compile, {\bf ×2.88} over Torch Compile with reduce overhead, and {\bf ×2.81} over CUDA Graph implementations. Furthermore, the model also demonstrates portability across GPU architectures, achieving average speedups of {\bf ×3.85} (median {\bf ×1.32}) on H100, {\bf ×3.13} (median {\bf ×1.31}) on L40, {\bf ×2.51} (median {\bf ×1.18}) on RTX 3090, and {\bf ×2.38} (median {\bf ×1.34}) on H20 despite being optimized specifically for A100.
Beyond these benchmark results, CUDA-L1 demonstrates several properties: CUDA-L1 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance.
The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge.
In this process, it identifies CUDA optimization patterns, discovers new techniques, synthesizes them to achieve speedups, and more importantly,
extends the acquired reasoning abilities to new kernels.
This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 14728
Loading