Keywords: Large Language Model, Reasoning, Diversity
Abstract: Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we highlight a critical yet underexplored aspect of their reasoning process—thinking schemata, which we define as the distinct transitions between reasoning steps and the variety of solution paths the model produces. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential and generalization ability. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a method to elicit diverse thinking schemata by first endowing the model with the capabilities to be aware of the thinking schemata in its reasoning chain and then encouraging their diversity through reinforcement learning. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization, with particularly pronounced gains on challenging datasets such as AIME, where our 7B and 32B DiScO models surpass the closed-source frontier LRMs by 15\%-30\%. Overall, our work suggests the important role that diversity of the reasoning procedure plays and points to scaling along the diversity dimension as a promising research direction.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1236
Loading