Reinforcement learning with Human Feedback: Learning Dynamic Choices via Pessimism

Published: 20 Jun 2023, Last Modified: 20 Jun 2023ILHF Workshop ICML 2023EveryoneRevisions
Keywords: Reinforcement Learning with Human Feedback; Offline Reinforcement Learning; Statistics;
Abstract: In this paper we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. We focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices, which is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method and prove that the suboptimality of DCPPO \textit{almost} matches the classical pessimistic offline RL algorithm in terms of suboptimality’s dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.
Submission Number: 5
Loading