Overton Pluralistic Reinforcement Learning for Large Language Models

ICLR 2026 Conference Submission13015 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Pluralistic Alignment, Overton Pluralism, Reinforcement Learning From Human Feedback, Large Language Models
TL;DR: We propose OP-GRPO, which uses Sentence Transformer to evaluate the perspective coverage of LLM responses and fine-tunes the LLM policies. OP-GRPO achieves promising performance compared to previous baselines while providing concise responses.
Abstract: Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism (OP) addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single LLM to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps: 1) Similarity estimator training, which fine-tunes a Sentence Transformer for OP tasks to provide a more accurate coverage evaluation of the given responses; and 2) OP-GRPO training, which incorporates this similarity estimator into a carefully designed dual-reward system to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate that OP-GRPO achieves a "Small models, Big perspective coverage" effect: our trained Qwen2.5-3B-Instruct surpasses the GPT-OSS (20B) baseline with a 37.4% relative accuracy gain on the Natural Language Inference (NLI) benchmark. It also outperforms a modular-architecture baseline with a 19.1% relative improvement. Evaluations with GPT-4.1 as LLM judge for response quality assessment further confirm the robustness of our approach.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13015
Loading