Shrinking the Generation-Verification Gap with Weak Verifiers

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time compute, repeated sampling, weak supervision, weak verification
TL;DR: We introduce Weaver, a framework that combines multiple weak verifiers to effectively select responses in repeated sampling, achieving frontier model accuracy without supervised fine-tuning, while reducing verification costs by 99.97%.
Abstract: Verifiers can improve language model (LM) capabilities by providing feedback or selecting the best response from a pool of generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean for formal proofs). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers. To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. First we find that weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in the verifiers. To reduce the dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses several challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these challenges by using dataset statistics to normalize outputs and filter specific verifiers. We study the effectiveness of Weaver in repeated sampling settings, where a model generates multiple candidate responses at test time and a verifier is used to select the correct one. Our evaluations demonstrate that Weaver significantly improves the pass@1 performance across several reasoning and math tasks, achieving o3-mini level accuracy with Llama 3.3 70B Instruct (a much cheaper non-reasoning model) as the generator, and an ensemble of smaller judge and reward models as the verifiers (86.2% average). This gain mirrors the jump achieved between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training interventions. To make Weaver more efficient, we train a compact 400M cross-encoder using Weaver's combined output scores. This distilled model retains 98.7% of Weaver's full accuracy while reducing verification compute by up to 99.97%.
Submission Number: 12
Loading