CCKS: Cooperative CPU-GPU Scheduling for Fused Kernels on Coherent Architectures

Zhiyuan Guo; Jiaxin Lin

CCKS: Cooperative CPU-GPU Scheduling for Fused Kernels on Coherent Architectures

Zhiyuan Guo, Jiaxin Lin

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Kernel Fusion, GPU Scheduling, Systems for ML, GPU Archiecture

TL;DR: Kernel fusion is bottlenecked by on-GPU scheduling. We offload scheduling to a tightly integrated CPU "co-pilot" using cache-coherent links, significantly improving kernel fusion performance.

Abstract: Executing modern ML workloads as sequences of discrete GPU kernels leads to significant hardware underutilization because of kernel launch, data movement, and CPU-GPU synchronization overheads. Recent advancements in kernel fusion reduce small kernel launch overhead by consolidating many small kernels into a single, persistent kernel. However, existing fusion techniques delegate complex scheduling logic to the GPU itself—a task for which its architecture is ill-suited. This on-GPU scheduling creates critical inefficiencies, as its control-intensive, synchronization-heavy logic is fundamentally mismatched with the GPU's parallel microarchitecture, and leads to stalled threads during synchronization, and high-overhead collection of global state. We propose CCKS (Cooperative Coherent Kernel Scheduler), a novel framework that leverages tightly-integrated, cache-coherent CPU-GPU architectures such as the NVIDIA Grace Hopper Superchip for fused kernel scheduling. CCKS offloads the scheduling of fused kernels to the host CPU, treating it as a dedicated co-processor. In our design, the GPU's role is simplified to that of an efficient information provider and decision executor. This division of labor is enabled by a near-zero overhead, cache-coherent interface that exposes GPU runtime state and allows the CPU to make and propagate scheduling decisions asynchronously concurrently. To facilitate our approach, we introduce an innovative programming framework that automatically generates the requisite CPU scheduler and GPU code from a high-level description. Our evaluation shows that CCKS achieves up to 77% performance improvement over state-of-the-art kernel fusion frameworks on representative ML workloads.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 23384

Loading