Abstract: Large language model (LLM) reasoning can be improved by scaling test-time compute with aggregation, i.e., generating multiple samples and aggregating over them. While improving performance, this strategy often reaches a saturation point beyond which additional compute provides no return. Refinement offers an alternative by using model-generated feedback to improve answer quality. However, refinement faces three key challenges: (1) Excessive refinement: Uniformly refining all instances can cause over-correction and reduce overall performance. (2) Inability to localize and address errors: LLMs struggle to identify and correct their own mistakes. (3) Insufficient refinement: Stopping refinement too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, a framework for Multi-Agent Iteration for Coarse-to-fine Refinement. MAgICoRe mitigates excessive refinement by categorizing problems as easy or hard, solving easy problems with coarse-grained aggregation, and solving the hard ones with fine-grained multi-agent refinement. To better localize errors, we incorporate external step-wise reward model scores, and to ensure sufficient refinement, we iteratively refine the solutions using a multi-agent setup. We evaluate MAgICoRe on Llama-3-8B and GPT- 3.5 and show its effectiveness across seven reasoning datasets. One iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% even when these baselines use k = 120, and MAgICoRe uses less than 50% of the compute.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: LLM reasoning, LLM refinement, Multi-agent refinement
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 4547
Loading