Abstract: Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We followed the two suggestions by the Action Editor:
- We deleted the name of "Assumption A" and presented it as a heuristic described within the text.
- We ran some variants of PGKQ algorithms with kernel thinning/herding (instead of the one from Hayakawa et al. (2022)) against the task "Hopper-v4," and presented the results in Appendix C.
Supplementary Material: zip
Assigned Action Editor: ~Mirco_Mutti1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1901
Loading