Keywords: Contextual Combinatorial Bandits, Human Feedback, Model Misspecification
TL;DR: Proposed an algorithm that learns the optimal contextual set selection from human feedback with model misspecification.
Abstract: A common and efficient way to elicit human feedback is to present users with a set of options, and record their relative preferences on the presented options. The contextual combinatorial bandits problem captures this setting algorithmically; however, it implicitly assumes an underlying consistent reward model for the options.
The setting of human feedback (which e.g. may use different reviewers for different samples) means that there may not be any such model -- it is *misspecified*.
We first derive a lower-bound for our setting, and then show that model misspecification can lead to catastrophic failure of the C$^2$UCB algorithm (which is otherwise near-optimal when there is no misspecification). We then propose two algorithms: the first algorithm (MC$^2$UCB) requires knowledge of the level of misspecification $\epsilon$ (i.e., the absolute deviation from the closest well-specified model). The second algorithm is a general framework that extends to unknown $\epsilon$. Our theoretical analysis shows that both algorithms achieve near-optimal regret. Further empirical evaluations, conducted both in a synthetic environment and a real-world application of movie recommendations, demonstrate the adaptability of our algorithm to various degrees of misspecification. This highlights the algorithm's ability to effectively learn from human feedback, even with model misspecification.
Submission Number: 14
Loading