Efficient Embedding-Generation Serving with Heterogeneous Batching

Dohyun Park; Hubertus Franke; Daniel G Waddington; swaminathan sundararaman; Yongjoo Park

Efficient Embedding-Generation Serving with Heterogeneous Batching

Dohyun Park, Hubertus Franke, Daniel G Waddington, swaminathan sundararaman, Yongjoo Park

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: inference, serving

TL;DR: We propose ORTHRUS, a system that co-serves embeddings and generation with heterogeneous batching for higher throughput

Abstract: Modern information retrieval increasingly relies on both embedding and generative models to achieve high accuracy. To make such applications more responsive, the underlying serving systems must be optimized for mixed workloads. Yet, current systems suffer from low throughput and poor GPU utilization, primarily because they cannot batch embedding and generation requests together. We address this bottleneck with heterogeneous batching, which schedules embedding and generation requests within the same batch. Realizing this idea requires two changes to the system internals: a \unified kernel abstraction and fine-grained intra-batch scheduling. The unified abstraction enables concurrent handling of embedding and generation, while the intra-batch scheduler dynamically adapts batch composition to balance end-to-end throughput across both tasks. Our evaluation with four A100 GPUs shows that heterogeneous batching achieves 1.28$\times$-4.52$\times$ higher throughput and 35.8-52.0\% lower latency than default vLLM.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 12450

Loading