Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

ACL ARR 2026 January Submission3648 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning; Large Language Models; RLVR; Self-Improvement; Pseudo-label Selection
Abstract: Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose \textbf{D}ual \textbf{C}onsensus \textbf{R}einforcement \textbf{L}earning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an \textit{anchor}, producing dominant responses; then it serves as an \textit{explorer}, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domins, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Language Modeling; Machine Learning for NLP
Contribution Types: Surveys, Theory
Languages Studied: English
Submission Number: 3648
Loading