Review, Revise, and Learn: Peer-Guided Preference Learning via LLM Self-Correction

Minsang Kim; Seung Jun Baek

Review, Revise, and Learn: Peer-Guided Preference Learning via LLM Self-Correction

Minsang Kim, Seung Jun Baek

18 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, LLM Preference Optimization, Multi-LLM Agents, LLM Alignment

TL;DR: PULSE is a framework where multiple LLMs use a collaborative peer-review process to critique and refine each other's responses, autonomously generating preference data to train better models without human supervision.

Abstract: Preference optimization plays a central role in achieving state-of-the-art performance in large language models (LLMs). However, preference learning requires large-scale, high-quality, or human-annotated datasets, which poses a significant challenge to the continual improvement of LLMs. We introduce PULSE (Peer-gUided Preference Learning via LLM SElf-correction), a collaborative framework of multiple LLM agents for scalable preference learning. PULSE is inspired by the academic peer-review process: an actor LLM first generates an initial response to a query, and critic peer LLMs evaluate and provide feedback on the response. The actor revises or corrects its response based on this feedback, and the critics finally assign scores to the revised response. The scores of the initial and revised outputs are used as preference scores to construct preference data. This process enables autonomous and collective reasoning of LLMs for constructing preference data without human supervision. However, preference data constructed by LLMs may be subject to noise or reward hacking. To mitigate the issue, we first provide a unified view on robust preference learning through the lens of risk minimization, and then propose a framework for robust training on self-correction datasets. Experiments show that PULSE significantly outperforms existing approaches, achieving performance gains up to 47.3\% and 34.6\% on Alpaca LC and Alpaca 2.0, and 23.9\%, 102.8\%, and 12.4\% on a collection of math, coding, and general reasoning tasks, demonstrating its potential to create and sustain scalable LLM ecosystems.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 11610

Loading