Track: Type D (Master/Bachelor Thesis Abstracts)
Keywords: Reinforcement Learning, GRPO, PPO
Abstract: Recently, reinforcement learning (RL) has played a key role in fine-tuning Large
Language Models (LLMs) through the use of RLHF. Proximal Policy Optimization (PPO), a well-established algorithm for RL, has often been used
in this domain. As an alternative, Deepseek proposed Group Relative Policy
Optimization (GRPO), which avoids the critic model required in PPO. Instead,
GRPO uses a group mechanism for the policy update. Although promising in
the LLM setting, there is little research on GRPO in classic RL tasks, which for
example contain non-terminal rewards. Therefore, this paper compares PPO and
GRPO on various standard RL environments.
Serve As Reviewer: ~Koen_Ponse1
Submission Number: 61
Loading