Abstract: Latent Diffusion Models (LDM) have emerged as a prominent approach within the broader field of generative AI, particularly for consumer-level image generation tasks. These models enable efficient inference of Diffusion Models (DM) by leveraging latent space representations, reducing computational requirements while preserving output quality and flexibility. Advanced sampling algorithms further enhance inference speed and quality, enabling large-scale, low-latency image generation services. However, image generation inference remains time-consuming, and there is no specialized scheduling system in the domain of large-scale image generation models to ensure high resource utilization and latency guarantees. To address this, we introduce a two-stage method of saving intermediate samples, which helps to bypass initial sampling steps and accelerates image generation time. To provide predictable and high-utilization services for large-scale image generation requests, we conduct an in-depth analysis of the LDM structure and find that the response computation time is highly predictable. We further propose RAPID, an online acceleration scheduling framework designed for LDM-based networking request services. RAPID effectively reduces latency and optimizes load balancing across heterogeneous GPUs through precise computation scheduling tailored to specific GPUs. Extensive experiments indicate that RAPID achieves a ~37% increase in inference speed in multi-GPU high-concurrency environments.
External IDs:dblp:journals/tccn/PingMC25
Loading