Scheduling Data Processing Pipelines for Incremental Training on MLP-based Recommendation Models

Published: 01 Jan 2025, Last Modified: 11 Nov 2025SIGMOD Conference Companion 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multi-layer Perceptron (MLP)-based models have been widely exploited by modern recommendation applications. In practice, industrial recommendation scenarios frequently launch continuous incremental training jobs with only one epoch to capture real-time user features. This kind of job is shorter than full training and has a larger proportion of feature processing time. To fully utilize fragmentation resources, our model engineering team at Tencent explores resource-constrained CPU clusters to perform such incremental training workloads. To improve the efficiency of such workloads, we notice scheduling optimizations by overlapping feature processing and model training at the level of data processing pipelines. In particular, we propose an intra-pipeline scheduling strategy, which prefetches feature processing operators dynamically to fill the idle time of CPUs during the communication of embedding lookup. Furthermore, we propose an inter-pipeline scheduling strategy, which balances the resource demands of different pipelines. It prioritizes the execution of critical pipelines and overlaps the communication in critical pipelines with the execution of non-critical pipelines. Based on the two scheduling strategies, we implement a novel incremental recommendation training framework called RECS on top of TensorFlow. In our experimental studies, RECS achieves a speedup of 1.36x over existing solutions on industrial workloads.
Loading