Policy Gradient with Kernel Quadrature

Satoshi Hayakawa; Tetsuro Morimura

Policy Gradient with Kernel Quadrature

Satoshi Hayakawa, Tetsuro Morimura

Published: 21 Feb 2024, Last Modified: 21 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Event Certifications: iclr.cc/ICLR/2025/Journal_Track

Abstract: Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.

Certifications: J2C Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We followed the two suggestions by the Action Editor: - We deleted the name of "Assumption A" and presented it as a heuristic described within the text. - We ran some variants of PGKQ algorithms with kernel thinning/herding (instead of the one from Hayakawa et al. (2022)) against the task "Hopper-v4," and presented the results in Appendix C.

Supplementary Material: zip

Assigned Action Editor: ~Mirco_Mutti1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1901

Loading