SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs

zhicheng li; Shuoming Zhang; Jiacheng Zhao; Siqi Li; Xiyu Shi; Yangyu Zhang; Shuaijiang Li; Donglin Yu; Zheming Yang; YUAN WEN; Huimin Cui

SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs

zhicheng li, Shuoming Zhang, Jiacheng Zhao, Siqi Li, Xiyu Shi, Yangyu Zhang, Shuaijiang Li, Donglin Yu, Zheming Yang, YUAN WEN, Huimin Cui

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal large language models; Inference optimizations; Infrastructure

Abstract: Recent multimodal large language models (MLLMs) marry modality-specific vision or audio encoders with a shared text decoder. While the encoder is compute- intensive but memory-light, the decoder is the opposite, yet state-of-the-art serving stacks still time-multiplex these complementary kernels, idling SMs or HBM in turn. We introduce SpaceServe, a serving system that space-multiplexes MLLMs: it decouples all modality encoders from the decoder, and co-locates them on the same GPU using fine-grained SM partitioning available in modern runtimes. A cost-model-guided Space-Inference Scheduler (SIS) dynamically assigns SM slices, while a Time-Windowed Shortest-Remaining-First (TWSRFT) policy batches en- coder requests to minimise completion latency and smooth decoder arrivals. Evaluation shows that SpaceServe reduces time-per-output-token by 4.81× on average and up to 28.9× on Nvidia A100 GPUs. SpaceServe is available at https://github.com/gofreelee/SpaceServe

Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)

Submission Number: 11944

Loading