Toward 100TB Recommendation Models with Embedding Offloading

Intaik Park, Ehsan Ardestani, Damian Reeves, Sarunya Pumma, Henry Tsang, Levy Zhao, Jian He, Joshua Deng, Dennis van der Staay, Yu Guo, Paul Zhang

Published: 08 Oct 2024, Last Modified: 14 Apr 2026RecSys '24EveryoneRevisionsCC BY 4.0

Abstract: Training recommendation models become memory-bound with large embedding tables, and fast GPU memory is scarce. In this paper, we explore embedding caches and prefetch pipelines to effectively leverage large but slow host memory for embedding tables. We introduce Locality-Aware Sharding and iterative planning that automatically size caches optimally and produce effective sharding plans. Embedding Offloading, a system that combines all of these components and techniques, is implemented on top of Meta’s open-source libraries, FBGEMM GPU and TorchRec, and it is used to improve scalability and efficiency of industry-scale production models. Embedding Offloading achieved 37x model scale to 100TB model size with only 26% training speed regression.