Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning

ICLR 2026 Conference Submission11548 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM reasoning
TL;DR: We propose GPSO, a training paradigm that optimizes reasoning pattern selection and improves LLM reasoning accuracy
Abstract: Large reasoning models (LRMs) exhibit diverse high-level reasoning patterns (e.g., direct solution, reflection-and-verification, and exploring multiple solutions), yet prevailing training recipes implicitly bias models toward a few dominant patterns. We systematically analyze these patterns and observe substantial accuracy variance across patterns on math and science benchmarks, implying that the pattern a model adopts is often sub-optimal for a given problem. We introduce Group Pattern Selection Optimal (GPSO), a reinforcement-learning framework that extends GRPO with multi-pattern rollout, optimal pattern selection per problem via verifier signals, and attention masking to prevent leakage from explicit pattern suffixes into the learned policy. By intelligently exploring a portfolio of diverse patterns and optimizing the model's policy on the most effective ones, GPSO learns the intrinsic mapping from problem to pattern. Extensive experiments demonstrate that GPSO delivers consistent and substantial performance improvements across a wide range of model backbones and reasoning benchmarks, effectively mitigating the sub-optimality issue and enabling more robust and adaptable LLM reasoning.
Primary Area: reinforcement learning
Submission Number: 11548
Loading