Keywords: test-time scaling, reasoning, factuality, hallucinations
Abstract: Inference-time scaling methods improve language model performance, but existing methods lack the flexibility to synthesize information across multiple long-form generation samples. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared content and semantic variation across a set of LM responses to the same prompt. We construct ConGrs using a lightweight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. We design and evaluate task-dependent decoding methods to synthesize final responses from ConGrs. Our experiments show that synthesizing responses from ConGrs improves factual precision on a biography generation task by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. ConGrs efficiently encode the variation among responses, which can then be used to improve downstream performance on various tasks.
Submission Number: 132
Loading