Abstract: We present Decoupled Fallback, a work-stealing method that enables single-pass Chained Scans to run on hardware without forward-progress guarantees (FPG) while avoiding starvation. Additionally, we introduce a tile state representation for Chained Scans that does not rely on 64-bit atomics or memory barriers for correctness, along with a subgroup-size-agnostic intra-workgroup implementation. On FPG-lacking devices---Apple M1 Max and M3---Decoupled Fallback achieves near-Memcpy speeds for inclusive prefix sum using the native Dawn implementation of the WebGPU standard, and approaches the full theoretically expected 50% speedup over the slower Reduce-then-Scan approach. We further demonstrate the resilience of Decoupled Fallback against unfair schedulers by simulating blocking at rates of up to 50%, showing that it maintains superior performance over Reduce-then-Scan even under extreme contention.
External IDs:dblp:conf/spaa/SmithLO25
Loading