Keywords: Kernel method, Kernel gradient descent, PPO, Temporal difference
TL;DR: By framing PPO in an RKHS (kernel) setting, we provide a new analytical perspective that both deepens understanding and delivers global convergence guarantees.
Abstract: We revisit Proximal Policy Optimization (PPO) from a function-space perspective.
Our analysis decouples policy evaluation and improvement in a reproducing kernel Hilbert space (RKHS):
(i) A kernelized temporal-difference (TD) critic performs efficient RKHS-gradient updates using only one-step state–action transition samples.
(ii) a KL-regularized, natural-gradient policy step exponentiates the evaluated action-value, recovering a PPO/TRPO-style proximal update in continuous state-action spaces.
We provide non-asymptotic, instance-adaptive guarantees whose rates depend on RKHS entropy, unifying tabular, linear, Sobolev, Gaussian, and Neural Tangent Kernel (NTK) regimes, and we derive a sampling rule for the proximal update that ensures the optimal $k^{-1/2}$ convergence rate for stochastic optimization.
Empirically, the theory-aligned schedule improves stability and sample efficiency on common control tasks (e.g., CartPole, Acrobot, and HalfCheetah), while our TD-based critic attains favorable throughput versus a GAE baseline.
Altogether, our results place PPO on a firmer theoretical footing beyond finite-dimensional assumptions and clarify when RKHS-proximal updates with kernel-TD critics yield global policy improvement with practical efficiency.
Primary Area: learning theory
Submission Number: 24111
Loading