Abstract: We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our down-sweep shared-memory communication-reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction's performance on a GPU by a factor of 3-4.5x over the original CR implementation, making it 1.5-3x faster than other GPU tridiagonal solvers.