Abstract: Modern service systems are increasingly adopting innovative modalities enabled by emerging technologies,
such as AI-assisted services, to achieve a better balance between quality and efficiency. Motivated by these
advancements, we study the optimal dynamic control of service modes in systems with unknown parameters,
aiming to balance reward accumulation and system congestion. We consider a single-server queueing system
with two switchable service modes, each characterized by a distinct service rate and an unknown distribution
of rewards earned upon service completion. The objective is to maximize the long-run average of cumulative
rewards minus holding costs. To address this problem, we first characterize the optimal state-dependent
policy under full knowledge of model parameters. For the case of unknown reward distributions, we propose
an online learning algorithm based on Upper Confidence Bound estimates to adaptively learn the optimal
policy. Our algorithm achieves a statistically near-optimal regret bound of O(√T) over a time horizon T and
demonstrates strong numerical performance. A key methodological contribution is a novel regret decomposi-
tion and a regenerative cycle-based framework, offering broader insights into learning-based optimal control
in queueing systems. Finally, we demonstrate the practical relevance of our approach through a case study
on optimizing AI-assisted patient message replies in a healthcare setting.
Loading