Combinatorial Reinforcement Learning with Preference Feedback

Joongkyu Lee; Min-hwan Oh

Combinatorial Reinforcement Learning with Preference Feedback

Joongkyu Lee, Min-hwan Oh

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: combinatorial reinforcement learning, preference feedback, contextual MNL bandits, nonlinear function approximation

TL;DR: We consider combinatorial reinforcement learning with preference feedback, where a set of multiple items is offered and preference feedback is received, while accounting for state transitions in decision-making.

Abstract: In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action—an assortment of multiple items—to a user, whose preference feedback follows a multinomial logit (MNL) model. This framework allows us to model real-world scenarios, particularly those involving long-term user engagement, such as in recommender systems and online advertising. However, this framework faces two main challenges: (1) the unknown value of each item, unlike traditional MNL bandits (which only account for single-step preference feedback), and (2) the difficulty of ensuring optimism with tractable assortment selection in the combinatorial action space. In this paper, we assume a contextual MNL preference model, where mean utilities are linear, and the value of each item is approximated using general function approximation. We propose an algorithm, MNL-V$Q$L, that addresses these challenges, making it both computationally and statistically efficient. As a special case, for linear MDPs (with the MNL preference model), we establish a regret lower bound and show that MNL-V$Q$L achieves near-optimal regret. To the best of our knowledge, this is the first work to provide statistical guarantees in combinatorial RL with preference feedback.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8472

Loading