Abstract: Executing machine learning jobs on serverless platform could gain higher performance in a simplified manner. But current solutions have not fully utilized the flexibility of serverless computing. The widely used Bulk Synchronous Parallel (BSP) mode has significant resource waste, the parameter server nodes suffer bottleneck pressure from both networking and performance aspects. This paper presents Chorus, a machine learning framework on serverless platform based on the parameter server architecture. In Chorus we propose the Lambda Synchronous Parallel (LSP) model to coordinate Lambda workers. It simultaneously utilizes functions with different resource level to collaborate and dynamically adjusts resource allocation to maintain balance of model training. To alleviate the bottleneck pressure on the parameter server, we build a buffer system in memory database to exchange gradient data between the parameter server and workers. We set up multiple buffer slots in the buffer system to alleviate network pressure and design a buffer merging strategy to disperse computational pressure of parameter server among Lambda workers. In the experiments on different ML algorithms with different synchronous parallel models, Chorus shows outstanding performance improvements and budget-saving capacities.
Loading