Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures

Hee-Seok Kim; Izzat El Hajj; John A. Stratton; Steven S. Lumetta; Wen-mei W. Hwu

Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures

Hee-Seok Kim, Izzat El Hajj, John A. Stratton, Steven S. Lumetta, Wen-mei W. Hwu

Published: 01 Jan 2015, Last Modified: 14 Nov 2024CGO 2015EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.

Loading