Abstract: Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of code version and device (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present machine learning-based code version and device selection method, named MLCD, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57$\boldsymbol{\times}$, 1.58$\boldsymbol{\times}$, 2.68$\boldsymbol{\times}$, and 1.09$\boldsymbol{\times}$ compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28$\boldsymbol{\times}$ average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input $10^{8}-10^{10}\boldsymbol{\times}$ faster.
External IDs:doi:10.1109/tc.2025.3558606
Loading