TL;DR: We propose CTRL, a framework that trains LLMs to critique without human supervision, enabling them to supervise stronger models and achieve test-time scaling through iterative critique-revisions.
Abstract: Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide *accurate judgments* and *actionable suggestions*. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models.
Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1\% relative improvements across challenging code generation benchmarks.
Lay Summary: Large language models (LLMs) struggle to effectively critique their own work, lacking the ability to provide accurate and actionable feedback needed for self-improvement, especially in areas like code generation.
We introduce CTRL, a technique that trains an LLM critic using reinforcement learning—without direct human examples—to become skilled at providing clear, corrective feedback on code.
These CTRL-trained critics significantly boost code accuracy and reduce accumulated errors, even when assisting more powerful AI models, enabling AI systems to improve more efficiently through cycles of feedback and revision.
Link To Code: https://github.com/HKUNLP/critic-rl
Primary Area: Deep Learning->Large Language Models
Keywords: LLMs, learning to critique, generative reward models, reinforcement learning
Submission Number: 3057
Loading