Provable Policy Optimization for Reinforcement Learning from Trajectory Preferences with an Unknown Link Function

ICLR 2026 Conference Submission15628 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning theory, preference-based reinforcement learning, unknown link function, zeroth-order optimization, policy optimization, stochastic MDPs, provable convergence
TL;DR: ZSPO is a zeroth-order sign-based policy optimization algorithm with provable guarantees for reinforcement learning from trajectory preference feedback with an unknown link function.
Abstract: The link function, which characterizes the relationship between the preference for two trajectories and their cumulative rewards, is a crucial component in designing RL algorithms that learn from preference feedback. Most existing methods, both theoretical and empirical, assume that the link function is known (often a logistic function based on the Bradley-Terry model), which is arguably restrictive given the complex nature of preferences, especially those of humans. To avoid mis-specification, this paper studies preference-based RL with an unknown link function and proposes a novel zeroth-order policy optimization algorithm called ZSPO. Unlike typical zeroth-order methods, which rely on the known link function to estimate the value function differences and form an accurate gradient estimator, ZSPO only estimates the sign of the value function difference. It then constructs a parameter update direction that is positively correlated with the true policy gradient, eliminating the need to know the link function exactly. Under mild conditions, ZSPO provably converges to a stationary policy with a polynomial rate in the number of policy iterations and trajectories per iteration. Empirical evaluations further demonstrate the robustness of ZSPO under link function mis-specifications.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 15628
Loading