ProFetch: Accelerate Deep Recommendation System Training with Proactively Designed Data Layout and Dynamic Prefetching

Zhibing Liu, Biyu Zhou, Weigang Zhang, Xuehai Tang, Ruixuan Li, Songlin Hu

Published: 2024, Last Modified: 15 Jan 2026ICONIP (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recommendation systems based on deep learning have played a vital role in today’s society. Since the embedding layers of recommendation models consume massive memory, current practice is to handle them on CPUs and use GPUs to speed up the training of the remaining parameters. In this hybrid training paradigm, data transfer between CPU and GPU becomes a bottleneck. While some cache-enhanced efforts have been proposed, the somewhat passive nature limits the throughput of training. In this paper, we propose ProFetch, a novel proactive cache prefetching method, to fully leverage the cache and accelerate training. Specifically, we observe varying degrees of overlap in the embedding parameters accessed between randomly shuffled mini-batches, and exploit this to propose a mini-batch layout strategy capable of radically reducing CPU-GPU data transfer during prefetching. We then propose an aggressive cache prefetching strategy that adaptively determines the content to prefetch at each step, maximizing the overlap in data transfer with GPU computing. We conduct prototype testing with open-sourced deep learning based recommendation models. Experimental results show that compared with representative methods, ProFetch significantly improves the training throughput of cache-enhanced hybrid recommendation systems, achieving speedups of up to 2.06X.