A Parallel Scan Algorithm in the Tensor Core Unit Model

Published: 01 Jan 2023, Last Modified: 22 Mar 2025Euro-Par 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size s is a basic operation. In the \((s^2,\ell )\)-TCU model, we show that for inputs of size n, the algorithm has depth at most \(2\lfloor \log _s(n)\rfloor \) and runs in \(\mathcal {O}(n(1+\ell /s^2) / p + (s^2 + \ell ) \log _s (n))\) assuming p tensor core units. Equivalently, the algorithm performs \(\mathcal {O}(n/s^2)\) multiplications of square matrices of size s.
Loading