Inference Serving System for Stable Diffusion as a Service

Aritra Ray, Lukas Dannull, Farshad Firouzi, Kyle Lafata, Krishnendu Chakrabarty

Published: 27 Jun 2024, Last Modified: 03 Feb 2025IEEE Cloud SummitEveryoneCC BY 4.0

Abstract: We present a model-less, privacy-preserving, low-latency inference framework to satisfy user-defined System-Level Objectives (SLO) for Stable Diffusion as a Service (SDaaS). Developers of Stable Diffusion (SD) models register their trained models on our proposed system through a declarative API. Users, on the other hand, can specify SLOs in terms of the style of the generated image for their input text, the requested processing latency, and the minimum requested text-to-image similarity (CLIP score) for inference through the user API. Assuming black-box access to the registered models, we profile them on hardware accelerators to design an inference predictor module. It heuristically predicts the required number of inference steps for the user-requested text-to-image CLIP score and the requested latency, for a specific SD model over a hardware accelerator, to satisfy the SLO. In combination with the inference predictor module, we propose a shortest-job first algorithm for our inference framework. Compared to traditional Deep Neural Network (DNN) and Large Language Model (LLM) inference scheduling algorithms, our proposed method outperforms on average job completion time, and the average number of SLOs satisfied in a user-defined SLO scenario.