Sample Complexity of RLHF Reward Learning under General Reward Classes

Sample Complexity of RLHF Reward Learning under General Reward Classes

20 Apr 2026 (modified: 27 Apr 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study the minimax sample complexity of Bradley–Terry preference-based reward learning under an arbitrary reward class $\mathcal{R}$. For the realisable logistic (Bradley–Terry) preference model with a single-policy concentrability coefficient $C$, we prove matching upper and lower bounds $$N^\star(\mathcal{R}, \varepsilon) = \Theta\!\left(\frac{C^2\cdot \mathcal{H}(\mathcal{R}, \varepsilon/C)}{\kappa(B)\cdot\varepsilon^2}\right),$$ where $\mathcal{H}(\mathcal{R}, \varepsilon) := \log N(\mathcal{R}, \varepsilon, L^2(\tilde\mu))$ is the $L^2$ metric entropy of $\mathcal{R}$ under the induced action marginal $\tilde\mu$ and $\kappa(B) := \sigma(2B)(1-\sigma(2B))$ is the Bradley–Terry Fisher-information curvature on the pairwise-difference range $[-2B,2B]$. The bound matches in $C$, $\varepsilon$, and $B$ up to absolute constants, under a mild saturation condition on $\mathcal{R}$. The closure of the $\kappa(B)$-gap uses a boundary Bregman upper bound on the softplus, invoked in the Fano step with an adversarial ground truth whose induced pairwise differences saturate the boundary. Our result unifies and sharpens a line of work that had resolved the rate only for structured subclasses: linear, low-rank, and general preferences. Three standard reward classes instantiate the bound: linear in $\mathbb{R}^d$, rank-$k$ in a $d$-dimensional embedding, and Sobolev $W^{s,2}([0,1]^d)$. The upper bound uses a localised Rademacher argument on the conditional MLE driven by a quadratic curvature identity for the Bradley–Terry log-likelihood; the two Bregman inequalities driving the rate are mechanically verified in Lean 4 / Mathlib with zero sorry and no custom axioms. The lower bound is a Fano–Le Cam construction tailored to the Bradley–Terry Fisher information, made coverage-aware by a restricted packing on the support of the optimal policy and made $B$-matching by saturating the pairwise range.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Hanrui_Zhang1

Submission Number: 8514

Loading