Overton Pluralistic Reinforcement Learning for Large Language Models

Overton Pluralistic Reinforcement Learning for Large Language Models

ICLR 2026 Conference Submission13015 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Pluralistic Alignment, Overton Pluralism, Reinforcement Learning From Human Feedback, Large Language Models

TL;DR: We propose OP-GRPO, which uses Sentence Transformer to evaluate the perspective coverage of LLM responses and fine-tunes the LLM policies. OP-GRPO achieves promising performance compared to previous baselines while providing concise responses.

Abstract: Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism (OP) addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single LLM to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps: 1) Similarity estimator training, which fine-tunes a Sentence Transformer for OP tasks to provide a more accurate coverage evaluation of the given responses; and 2) OP-GRPO training, which incorporates this similarity estimator into a carefully designed dual-reward system to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate that OP-GRPO achieves a "Small models, Big perspective coverage" effect: our trained Qwen2.5-3B-Instruct surpasses the GPT-OSS (20B) baseline with a 37.4% relative accuracy gain on the Natural Language Inference (NLI) benchmark. It also outperforms a modular-architecture baseline with a 19.1% relative improvement. Evaluations with GPT-4.1 as LLM judge for response quality assessment further confirm the robustness of our approach.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13015

Loading