Keywords: inference, serving
TL;DR: We propose ORTHRUS, a system that co-serves embeddings and generation with heterogeneous batching for higher throughput
Abstract: Modern information retrieval increasingly relies on both embedding and generative models to achieve high accuracy. To make such applications more responsive, the underlying serving systems must be optimized for mixed workloads. Yet, current systems suffer from low throughput and poor GPU utilization, primarily because they cannot batch embedding and generation requests together. We address this bottleneck with heterogeneous batching, which schedules embedding and generation requests within the same batch. Realizing this idea requires two changes to the system internals: a \unified kernel abstraction and fine-grained intra-batch scheduling. The unified abstraction enables concurrent handling of embedding and generation, while the intra-batch scheduler dynamically adapts batch composition to balance end-to-end throughput across both tasks. Our evaluation with four A100 GPUs shows that heterogeneous batching achieves 1.28$\times$-4.52$\times$ higher throughput and 35.8-52.0\% lower latency than default vLLM.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 12450
Loading