Decoupled Fallback: A Portable Single-Pass GPU Scan

Thomas Smith, Raph Levien, John D. Owens

Published: 2025, Last Modified: 05 May 2026SPAA 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We present Decoupled Fallback, a work-stealing method that enables single-pass Chained Scans to run on hardware without forward-progress guarantees (FPG) while avoiding starvation. Additionally, we introduce a tile state representation for Chained Scans that does not rely on 64-bit atomics or memory barriers for correctness, along with a subgroup-size-agnostic intra-workgroup implementation. On FPG-lacking devices---Apple M1 Max and M3---Decoupled Fallback achieves near-Memcpy speeds for inclusive prefix sum using the native Dawn implementation of the WebGPU standard, and approaches the full theoretically expected 50% speedup over the slower Reduce-then-Scan approach. We further demonstrate the resilience of Decoupled Fallback against unfair schedulers by simulating blocking at rates of up to 50%, showing that it maintains superior performance over Reduce-then-Scan even under extreme contention.
Loading