Keywords: multi-armed bandits, wireless communications, resource allocation
Abstract: Data-driven solutions to resource allocation in wireless communications are becoming increasingly more pervasive to complement legacy model-based solutions. This paper studies resource allocation when the transmitter is oblivious to channel models and channel instantaneous realizations. Decision-making under a finite number of choices is modeled by multi-armed bandits (MABs), which effectively balance the rate of learning the channels (exploration) and their use in the meantime (exploitation). Despite that natural fit, some key metrics of interest (e.g., outage probability) cannot be directly specified by the average-based reward functions that MAB algorithms rely on. This paper adopts a broader notion of reward that subsumes the conventional average-based reward and accommodates other choices that can precisely specify the desired metrics of interest in communications. This leads to different principles for designing bandit algorithms. This framework is presented in a general form, and its specific applications to optimizing outage and latency are investigated.
Submission Number: 20
Loading