A comparison of GRPO and PPO in Reinforcement Learning Environments

Alexander Cremer

A comparison of GRPO and PPO in Reinforcement Learning Environments

Alexander Cremer

Published: 15 Oct 2025, Last Modified: 31 Oct 2025BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Type D (Master/Bachelor Thesis Abstracts)

Keywords: Reinforcement Learning, GRPO, PPO

Abstract: Recently, reinforcement learning (RL) has played a key role in fine-tuning Large Language Models (LLMs) through the use of RLHF. Proximal Policy Optimization (PPO), a well-established algorithm for RL, has often been used in this domain. As an alternative, Deepseek proposed Group Relative Policy Optimization (GRPO), which avoids the critic model required in PPO. Instead, GRPO uses a group mechanism for the policy update. Although promising in the LLM setting, there is little research on GRPO in classic RL tasks, which for example contain non-terminal rewards. Therefore, this paper compares PPO and GRPO on various standard RL environments.

Serve As Reviewer: ~Koen_Ponse1

Submission Number: 61

Loading