Abstract: We introduce a new Reflective Generative Model(RGM), which obtains OpenAI o3-mini's performance via a novel Reflective Generative Form. This form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) $\textbf{A unified interface for policy and process reward model}$: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) $\textbf{Eliminating the reliance on process-level annotation}$: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, RGM is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our RGM achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. Code will be available.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: scaling,reasoning,process reward model
Contribution Types: NLP engineering experiment
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section 7
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 5.1 5.2
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Appendix A.4
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 5.1
B6 Statistics For Data: Yes
B6 Elaboration: Section 5.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: section 5.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: section 5.1
C3 Descriptive Statistics: Yes
C3 Elaboration: section 5.1
C4 Parameters For Packages: Yes
C4 Elaboration: section 5.1
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1217
Loading