Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A rating-based reinforcement learning framework that learns rewards from large vision-language model feedback with enhancements to reduce instability from data imbalance and noisy ratings.
Abstract: Designing effective reward functions remains a fundamental challenge in reinforcement learning (RL), as it often requires extensive human effort and domain expertise. While RL from human feedback has been successful in aligning agents with human intent, acquiring high-quality feedback is costly and labor-intensive, limiting its scalability. Recent advancements in foundation models present a promising alternative--leveraging AI-generated feedback to reduce reliance on human supervision in reward learning. Building on this paradigm, we introduce ERL-VLM, an enhanced rating-based RL method that effectively learns reward functions from AI feedback. Unlike prior methods that rely on pairwise comparisons, ERL-VLM queries large vision-language models (VLMs) for absolute ratings of individual trajectories, enabling more expressive feedback and improved sample efficiency. Additionally, we propose key enhancements to rating-based RL, addressing instability issues caused by data imbalance and noisy labels. Through extensive experiments across both low-level and high-level control tasks, we demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods. Our results demonstrate the potential of AI feedback for scaling RL with minimal human intervention, paving the way for more autonomous and efficient reward learning.
Lay Summary: Teaching robots to perform tasks using reinforcement learning often requires carefully designing a reward function. This function tells the robot what is good or bad behavior. Creating such a function usually takes a lot of time and human effort, making it a major challenge in real-world applications. Instead of designing rewards manually, researchers have developed methods that learn them directly from human feedback. However, collecting enough high-quality feedback still involves significant human effort, which can be costly and difficult to scale. Our work introduces a way to automate this process using AI tools like ChatGPT or Gemini. These tools evaluate the robot’s behavior and provide scores, similar to how a teacher grades a student. The robot learns from these scores and improves over time. All that a human needs to provide is a simple description of the task in plain language, making it easier for anyone to help train intelligent systems without technical expertise. This framework offers a scalable and practical solution for generating large amounts of feedback to teach robots. It significantly reduces human effort in both designing reward functions and providing detailed feedback, enabling more efficient and accessible reinforcement learning in real-world scenarios.
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reinforcement Learning From Human Feedback, Reinforcement Learning From AI Feedback, Rating-based Reinforcement Learning
Submission Number: 7059
Loading