Keywords: RLHF, General Preference, Regularized RLHF, Strongly Convex, Greedy Sampling, Explore-Then-Commit, Fast Rates, Nash Equilibrium
TL;DR: Under feature diversity assumption, we establish fast regret bounds for contextual online RLHF under the Generalized Bilinear Preference Model for any strongly convex regularizer.
Abstract: We consider the problem of *regularized* best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the *Generalized Bilinear Preference Model (GBPM)*—capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix—to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the \emph{squared} estimation error, derived using \emph{only} strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a \emph{generic} polylogarithmic regret of $\tilde{\mathcal{O}}(\eta d^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{\eta r T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $\eta^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are \emph{not} KL-specific, but rather a fundamental consequence of generic strongly convex geometry.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 127
Loading