Register packing for cyclic reduction: a case study

Andrew A. Davidson; John D. Owens

Register packing for cyclic reduction: a case study

Andrew A. Davidson, John D. Owens

Published: 01 Jan 2011, Last Modified: 06 Nov 2024GPGPU 2011EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our down-sweep shared-memory communication-reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction's performance on a GPU by a factor of 3-4.5x over the original CR implementation, making it 1.5-3x faster than other GPU tridiagonal solvers.

Loading