Keywords: Inference, efficiency
Abstract: Foundation-model inference is usually served through a single chat-oriented runtime even when requests have very different resource lifetimes. We present **Prelude**, a resource-adaptive serving framework for decision-style LLM inference: judges, reward models, safety classifiers, routers, rerankers, embedding models, and prompt-logprob extractors that read a prompt and return a fixed-size artifact or at most one token.
Prelude classifies work into *OneShot*, *Mixed*, and *Decode* execution classes, avoiding per-request paged-decode state for fixed-output work while preserving the standard paged-KV path for open-ended generation. It also performs prefix-aware OneShot planning and uses an inference-only tokenizer, *fasttoken*, to remove CPU-side overhead from the same hot path.
On H200, Prelude reaches $16{,}311$ input tok/s on a Qwen3-0.6B prefill-only benchmark, achieving $2.08\times$ vLLM and $3.85\times$ SGLang. It also reaches $186.7$ req/s on Qwen3-4B at concurrency 96. A multi-token decode control closes the gap to $1.03\times$ vLLM, showing that the gains come from execution-class adaptation rather than a uniformly faster forward kernel.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 72
Loading