DSL-Monkeys: Self-Generated In-Context Examples for Low-Resource GPU DSL Kernels
Abstract: Domain-specific languages (DSLs) for GPU kernels such as TileLang and ThunderKittens remain challenging for LLMs due to scarce training data and limited documentation. Standard test-time techniques (repeated sampling, iterative refinement, evolutionary search) treat each kernel independently, missing a key insight: successful implementations encode reusable DSL usage patterns directly applicable to new tasks. We introduce DSL-Monkeys, a batched test-time approach that considers an entire collection of kernels jointly rather than individually. DSL-Monkeys maintains a growing archive of successfully-implemented kernels and retrieves them as in-context demonstrations for remaining tasks, resembling a form of test-time reinforcement learning—improving on future tasks by curating a retrieval database rather than updating weights. Starting from just 3 seed examples, DSL-Monkeys achieves 71\% and 18\% correctness on 200 KernelBench Level 1-2 tasks for TileLang and ThunderKittens respectively, compared to baseline pass@1 rates of <20\% and <1\%. Secondly, with curriculum decomposition, DSL-Monkeys solves complex linear attention variants that approach or match human-expert-written Triton performance. Thirdly, we explore scaling DSL-Monkeys to 10K PyTorch programs from GitHub, producing thousands of verified DSL implementations that address DSL data scarcity. We show that fine-tuning improves Qwen's TileLang implementation ability via both distillation and self-generated data, a potential path towards improving LLMs' code generation ability with emerging DSLs.
Submission Number: 59
Loading