Abstract: GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities. It estimates the advantage of an outcome from a group
of K outcomes, and promotes those with positive
advantages inside a trust region. Since GRPO
discriminates between good and bad outcomes
softly, it benefits from additional refinements such
as asymmetric clipping and zero-variance data filtering. While effective, these refinements require
significant empirical insight and can be challenging to identify. We instead propose an explicit
contrastive learning approach. Instead of estimating advantages, we bifurcate K outcomes into
positive and negative sets, then maximize the likelihood of positive outcomes. Our approach can be
viewed as an online instantiation of (multi-label)
noise contrastive estimation for LLM reasoning.
We validate our method by demonstrating competitive performance on a suite of challenging
math benchmarks against strong baselines such
as DAPO and online DPO.
Loading