Online Clustering of Dueling Bandits

Zhiyong Wang; Jiahang Sun; Mingze Kong; Jize Xie; Qinghua Hu; John C.S. Lui; Zhongxiang Dai

Online Clustering of Dueling Bandits

Zhiyong Wang, Jiahang Sun, Mingze Kong, Jize Xie, Qinghua Hu, John C.S. Lui, Zhongxiang Dai

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce the first clustering of dueling bandit algorithms for both linear and neural bandits, which provably benefit from cross-user collaboration in the presence of preference feedback.

Abstract: The contextual multi-armed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.

Lay Summary: Imagine you’re using a recommendation system, like one that suggests movies or products. Instead of giving a rating (like 5 stars), you might prefer to compare two options and say which one you like better. This kind of feedback is more natural and easier for people to provide. But how can a recommendation system learn from these comparisons and improve its suggestions over time, especially when there are many users with similar tastes? Our work tackles this problem by introducing a new way for recommendation systems to collaborate and learn from users’ preferences. Traditionally, systems group users together to share data and improve recommendations, but they rely on numerical ratings, which aren’t always practical. We propose two new methods that work with preference feedback (like “I prefer A over B”): - COLDB: This method assumes that user preferences can be modeled as a linear function based on the features of the items (like genre or price). - CONDB: This method uses neural network to handle complex non-linear preferences. Both methods come with strong theoretical guarantees, showing that collaboration among users leads to better recommendations. We tested them on simulated and real-world datasets, and they outperformed existing techniques. This work opens the door to more practical and effective recommendation systems that rely on natural human feedback.

Primary Area: General Machine Learning->Online Learning, Active Learning and Bandits

Keywords: Multi-armed bandits, dueling bandits, clustering of bandits

Submission Number: 14633

Loading