Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Multi-Model Orchestration, Adaptive Routing, Serverless Inference, Self-Hosted AI Systems
TLDR: A lightweight Kubernetes-based framework that dynamically selects and scales large language models to balance accuracy, latency, and cost in self-hosted deployments.
Abstract: Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models Llama 3 (90 B), Gemma 3 (27 B), Qwen 3 (235 B), and DeepSeek R1 (685 B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 3200 prompts and 1,60,000 inference runs. Pick and Spin achieves up to 10% higher accuracy, 30% lower latency, and 33% lower GPU cost per query compared with static deployments. These results show that intelligent orchestration and efficient scaling enable enterprise grade LLM performance on self hosted infrastructure, bringing high capacity AI within practical and affordable reach.
Submission Number: 43
Loading