Abstract: Unified Virtual Memory (UVM) is a promising feature in CPU-GPU heterogeneous systems that allows data structures to be accessed by both CPU and GPUs through unified pointers without explicit data copying. However, the delivered performance of UVM significantly relies on the efficiency of address translation. The current GPU thread block (TB) management is not aware of the translation process and heavily thrashes the per-streaming multiprocessor (SM) private Translation Look-ahead Buffers (TLBs). In this paper, we conduct a comprehensive characterization of 10 GPU benchmarks and quantify the translation reuses among the thread blocks. Our observation reveals that there exists substantial translation reuse within TBs rather than across the TBs. Moreover, the inter-TB interference significantly enlarges the intra-TB translation reuse distances. To this end, we propose a translation-aware TB scheduling and lightweight GPU L1 TLB partitioning to effectively mitigate the contention. Experimental results show that our proposed approach improves the L1 TLB hit rate, and this improvement translates to, on average, a 12.5% execution time reduction.
Loading