Keywords: reinforcement learning, reasoning, large language models, verifiable rewards, nonverifiable rewards
TL;DR: We investigate likelihood-based rewards for LLM reasoning, applying them to both verifiable (mathematical) questions, as well as long-form nonverifiable domains.
Abstract: Fine-tuning large language models (LLMs) for reasoning is usually done by reinforcement learning on reasoning benchmarks, with a specific reward function, often binary, for each benchmark. Here, we systematically investigate chain-of-thought (CoT) training in LLMs using rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data). Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER), which have the advantage of not relying on specific verifiers and being available at scale.
We systematically compare such probability-based variants with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the \emph{log-probability} of the reference answer as the reward for CoT learning is the only option that performs well in all setups. This is also consistent with the next-token log-likelihood loss used during pre-training.
In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and
yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer.
Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short-answer, verifiable and long-answer, non-verifiable settings.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20983
Loading