Abstract: Large-scale cloud providers rely on cluster managers for
container allocation and load balancing (e.g., Kubernetes),
VM provisioning (e.g., Protean), and other management tasks.
These cluster managers use algorithms or heuristics whose
behavior depends upon multiple configuration parameters.
Currently, operators manually set these parameters using a
combination of domain knowledge and limited testing. In very
large-scale and dynamic environments, these manually-set pa-
rameters may lead to sub-optimal cluster states, adversely
affecting important metrics such as latency and throughput.
In this paper we describe SelfTune, a framework that au-
tomatically tunes such parameters in deployment. SelfTune
piggybacks on the iterative nature of cluster managers which,
through multiple iterations, drives a cluster to a desired state.
Using a simple interface, developers integrate SelfTune into
the cluster manager code, which then uses a principled rein-
forcement learning algorithm to tune important parameters
over time. We have deployed SelfTune on tens of thousands
of machines that run a large-scale background task sched-
uler at Microsoft. SelfTune has improved throughput by as
much as 20% in this deployment by continuously tuning a
key configuration parameter that determines the number of
jobs concurrently accessing CPU and disk on every machine.
We also evaluate SelfTune with two Azure FaaS workloads,
the Kubernetes Vertical Pod Autoscaler, and the DeathStar
microservice benchmark. In all cases, SelfTune significantly
improves cluster performance
Loading