Keywords: Offline contextual dueling bandits, clustering of bandits, active data augmentation
TL;DR: We address poor coverage in offline preference learning by clustering similar users to aggregate data and by augmenting offline data with actively selected data, achieving strong theoretical guarantees and empirical gains.
Abstract: Offline preference learning from pairwise feedback is an important problem in applications such as AI alignment and recommendations.
Due to the static nature of offline data, most prior methods in this area suffer from poor coverage of the feature (i.e., context-action) distribution induced by the optimal policy for taking actions that a user most prefers. To address the sample restrictions and poor coverage challenges of offline preference learning, this work considers two complementary solutions. First, we exploit data from multiple users within a pure offline setting by learning user similarities. We design Off-C$^2$PL, which aggregates offline data from users with similar preferences to broaden the sample size. Our theoretical results show that this approach improves coverage and reduces policy suboptimality. Second, we consider a hybrid setting in which we can actively collect a small number of samples to augment the offline data. In this setting, we propose ADA-Off-C$^2$PL, which targets the least-covered directions of the offline data to alleviate poor coverage. Theoretical results demonstrate that this approach is particularly effective under highly imbalanced offline data, where the offline data provide good coverage for most feature dimensions but poor coverage for a few. Empirical results on synthetic and real-world datasets show that our methods outperform baselines by at least $57.5$\%.
Submission Number: 34
Loading