Keywords: reinforcement learning, efficient large language models, query routing
TL;DR: RL agent trained to route LLM requests to models in a latency-sensitive manner
Abstract: Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10× higher client request rates, serves above 96% of peak performance 4.1× more often, and serves above 98% of peak performance 2.3× more often than static serving on unpredictable workloads.
Submission Number: 13
Loading