Efficient Algorithms for the Summed Area Tables Primitive on GPUs

Published: 01 Jan 2018, Last Modified: 12 Nov 2025CLUSTER 2018EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Two-dimensional Summed Area Tables (SAT) is a fundamental primitive used in image processing and machine learning applications. We present a collection of optimization methods for computing SAT on CUDA-enabled GPUs. Conventional approaches rely on computing the prefix sum in one dimension in parallel, transposing the matrix, then computing the prefix sum for the other dimension in parallel. Additionally, conventional methods use the scratchpad memory as cache. We propose a collection of algorithms that are scalable with respect to problem size. We use the register cache technique instead of the scratchpad memory and also employ a naive serial scan on the thread level for computing the prefix sum for one of the dimensions. Using a novel transpose-in-registers method we increase the inter-thread parallelism and outperform conventional SAT implementations. In addition, we significantly reduce both the communication between threads and the number of arithmetic instructions. On an Nvidia Pascal P100 GPU and Volta V100, our evaluations demonstrate that our implementations outperform state of the art libraries and yield up to 2.3x and 3.2x speedup over OpenCV and Nvidia NPP libraries, respectively.
Loading