Learning the Pareto Front Using Bootstrapped Observation Samples
TL;DR: A novel algorithm to achieve the optimal sample complexity bound and the minimal regret bound in Pareto front identification
Abstract: We consider Pareto front identification~(PFI) for linear bandits (PFILin), i.e., the goal is to identify a set of arms with undominated
mean reward vectors when the mean reward vector is a linear function of the context.
PFILin includes the best arm identification problem and multi-objective active learning as special cases.
The sample complexity of our proposed algorithm is optimal up to a logarithmic factor.
In addition, the regret incurred by our algorithm during the estimation is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front.
Our key contribution is a new estimator that in every round updates the estimate for the unknown parameter along \emph{multiple} context directions -- in contrast to the conventional estimator that only updates
the parameter estimate along the chosen context.
This allows us to use low-regret arms to collect information about Pareto optimal arms.
Our key innovation is to reuse the exploration
samples multiple times; in contrast to conventional estimators that use each sample only once.
Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while controlling the regret.
Submission Number: 243
Loading