CPC-GRPO:Answer-Free Reinforcement Learning with Cross-Prompt Consensus Rewards

CPC-GRPO:Answer-Free Reinforcement Learning with Cross-Prompt Consensus Rewards

ACL ARR 2026 January Submission10228 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Answer-free Reinforcement Learning, Prompt Ensemble, Reinforcement Learning

Abstract: Reinforcement learning with verifiable rewards has improved reasoning in language models, but it typically relies on a ground-truth answer or an external verifier, which limits applicability and increases cost. We propose an answer-free training objective that derives rewards solely from the model’s own probabilities by exploiting prompt paraphrases as multiple semantic views of the same intent. For each paraphrase set, we generate candidate responses, rescore each response under the other paraphrased prompts via teacher forcing, and define a cross-prompt consensus reward that favors responses supported across views rather than those that fit a single phrasing. We optimize this reward using a policy update with an all-pairs objective and advantage broadcasting across prompt–response pairs. The framework naturally supports prefix-level training, enabling a controllable cost–signal trade-off. Experiments on RobustAlpacaEval and out-of-domain reasoning benchmarks (OpenBookQA, AQuA, HumanEval) show consistent gains over the pre-trained baselines on LLaMA3.2-3B and Qwen3-4B, alongside analyses demonstrating reward–performance alignment and the importance of design choices such as excluding self-view scores and ensembling-based candidates. All experiment code is available at our GitHub.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning, optimization methods, self-supervised learning

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 10228

Loading