Keywords: Multimodal large language models; Inference optimizations; Infrastructure
Abstract: Recent multimodal large language models (MLLMs) marry modality-specific
vision or audio encoders with a shared text decoder. While the encoder is compute-
intensive but memory-light, the decoder is the opposite, yet state-of-the-art serving
stacks still time-multiplex these complementary kernels, idling SMs or HBM in
turn. We introduce SpaceServe, a serving system that space-multiplexes MLLMs:
it decouples all modality encoders from the decoder, and co-locates them on the
same GPU using fine-grained SM partitioning available in modern runtimes. A
cost-model-guided Space-Inference Scheduler (SIS) dynamically assigns SM slices,
while a Time-Windowed Shortest-Remaining-First (TWSRFT) policy batches en-
coder requests to minimise completion latency and smooth decoder arrivals.
Evaluation shows that SpaceServe reduces time-per-output-token by 4.81×
on average and up to 28.9× on Nvidia A100 GPUs. SpaceServe is available at
https://github.com/gofreelee/SpaceServe
Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
Submission Number: 11944
Loading