Keywords: Dueling Bandits; Evaluation Bias
Abstract: In dueling bandits, an agent explores and exploits choices (i.e., arms) by learning from their stochastic feedback in the form of relative preferences. Prior related studies focused on unbiased feedback. In practice, however, the feedback provided by evaluators can be biased. For example, human users are likely to provide biased evaluation towards large language models due to their heterogeneous background. In this work, we aim to minimize the regret in dueling bandits considering evaluators’ biased feedback. We begin with a benchmark case where evaluators’ bias information is known. Solving the known-bias case is nontrivial, because the bias cannot be easily decoupled from the feedback. We overcome this challenge and propose an unbiased arm performance estimator and a bias-sensitive dueling bandits algorithm. We manage to analyze the regret, dealing with the complex form of the estimator, and show that the feedback either matching or opposing the ground-truth reduces the regret. Then, we study the case where evaluators’ bias information is unknown. The associated estimator can hardly be solved in closed-form due to the non-convexity of the estimator solving problem. We address this challenge and propose an extended bias-sensitive algorithm by incorporating block coordinate descent. This algorithm is proven to achieve the same order of regret (as in the known bias case) with a bounded error. Experiments show that when compared with baselines, our algorithms reduces the regret by up to 86.9%.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 15574
Loading