Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.
Lay Summary: Reinforcement learning (RL), where AI learns by trial and error, is often less stable than supervised learning. This instability becomes even greater when learning from human or AI feedback, such as in RLHF or RLAIF, because varied preferences or misleading reward signals introduce noise, making it difficult for the AI to identify which actions are truly beneficial. To address this, we adapted reverse cross-entropy, a method from supervised learning known for handling noisy data, to create a symmetric RL loss. This approach makes the RL learning process more stable and reliable. Our method demonstrated strong and consistent performance improvements across diverse tasks, including video games, robotic control, and natural language processing with RLHF. This research contributes to building more robust AI systems that can learn effectively from complex, imperfect feedback.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Reinforcement Learning
Keywords: Reinforcement learning, Cross entropy, Symmetric loss functions
Submission Number: 13418
Loading