Abstract: In high-stakes environments where uncertainties abound, set-valued prediction offers a cautious and robust mechanism by presenting multiple potential labels as the prediction for each test instance to mitigate the potential risk associated with prediction errors. Yet, integrating this paradigm with out-of-distribution (OOD) detection remains scarcely explored in such settings as online learning with bandit feedback. The bandit feedback mechanism informs the learner about the correctness of the pulled arm/action instead of the explicit ground truth label, leaving the true class label unknown when an incorrect action is taken. To address this challenge, we introduce BanditGPS which conducts set-valued prediction with OOD detection in the bandit feedback setting, using an estimation to the ground truth of class labels. BanditGPS achieves three objectives: render small/informative prediction sets, enhance the OOD detection performance, and control the recall for all normal classes to meet prescribed requirements. Our approach is characterized by the loss function, which trades off between high OOD detection and small prediction sets. Theoretically, we prove that the convergence rate of the regret is $\tilde{\mathcal{O}}(T^{-1/2})$. The empirical results further show that BanditGPS effectively controls the recalls with promising performances on OOD detection and informative prediction.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Revise the paper after collecting the review feedback.
Assigned Action Editor: ~Huazheng_Wang1
Submission Number: 3884
Loading