Distributed Multi-Task Learning for Stochastic Bandits With Context Distribution and Stage-Wise Constraints
Abstract: We present conservative multi-task learning in stochastic linear contextual bandits with heterogeneous agents. This extends conservative linear bandits to a distributed setting where $M$ agents tackle different but related tasks while adhering to stage-wise performance constraints. The exact context is unknown, and only a context distribution is available to the agents as in many practical applications that involve a prediction mechanism to infer context, such as stock market prediction and weather forecast. We propose a distributed upper confidence bound (UCB) algorithm, DiSC-UCB. Our algorithm dynamically constructs a pruned action set for each task in every round, guaranteeing compliance with the constraints. Additionally, it includes synchronized sharing of estimates among agents via a central server using well-structured synchronization steps. For $d$-dimensional linear bandits, we prove an $\tilde{O}(d\sqrt{MT})$ regret bound and an $O(M^{1.5} d^3)$ communication bound on the algorithm. We extend the problem to a setting where the agents are unaware of the baseline reward. We provide a modified algorithm, DiSC-UCB-UB, and show that it achieves the same regret and communication bounds. We empirically validated the performance of our algorithm on synthetic data and real-world Movielens-100$K$ and LastFM data and also compared it with some existing benchmark algorithms.
Loading