Revision History for Response by Authors

Official Comment Edit by Authors

21 Nov 2024, 17:05 Coordinated Universal Time

Title: Response by Authors
Comment:
We thank the reviewer for the valuable feedback and the interest in our work. Herein, we address the points raised by the reviewer:

1. Experiment Setup for Reproducibility

"More details on the experimental setup could be provided for reproducibility, including the reward thresholds for the Clip and Delta mechanisms and hyperparameters for PPO."
- We have added the hyperparameters for PPO in Appendix D.
- In our experiments, we set $\eta$ to be the average value of PRM rewards of all reasoning steps related to one question in a training batch. We implement the training pipeline based on ReaLHF [1] by implementing the success reward and the dense rewards provided by PRM.
[1] Mei, Zhiyu, et al. "ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation." arXiv preprint arXiv:2406.14088 (2024).

2. Experiments on Larger LLMs

"Smaller LLMs may offer a larger scope of improvement, and so the proposed methods may seem to have been successful. However, to confirm the advantage, experiments on larger LLMs may be necessary; e.g., the paper reports GPT-4o-2024-08-06's performance to be 92.9 on GSM8K, which is higher than almost all other models and variations, and offers less scope of improvement."
- Our main focus is investigate how to unleash the potential of PRM in RL training for LLM reasoning. Our main experiments over a diverse set of LLMs in Sec. 4 present a comparison between RL training with success rewards only and RL training that combines PR-Clip-Delta and success rewards. We believe our evaluation presents a solid empirical justification for the effectiveness of the proposed methods.
- If a more powerful PRM is available or additional data is available, we believe the proposed approaches in our work could also benefit stronger and larger models. However, these directions are out of the scope of our work and we leave them as future work.
3. Computational Overhead

"Any discussion on the computational overhead or ease of integration of Clip and Delta into existing workflows would be beneficial."
- Both the Clip and the Delta mechanisms are straightforward to integrate into existing workflows.
- The implementation of the Clip mechanism involves computing the mean of the reward as a threshold after reward calculation, followed by applying the formula specified in Eq. 5. This additional step is computationally lightweight and seamlessly fits within the existing reward processing pipeline.
- The Delta mechanism requires computing the difference between rewards from two adjacent steps, a process that is both conceptually simple and computationally efficient. As such, neither method introduces significant overhead, ensuring their ease of adoption.
- We have also added related discussion in Appendix D.
4. Extension to Other Tasks

"Although mathematical reasoning is a good testbed, any comments on the potential applicability of these techniques in non-mathematical or multi-modal reasoning tasks?"

We thank the reviewer for raising the important question of broader applicability. While our work focuses on mathematical reasoning, the proposed techniques are not inherently limited to this domain. The approach can indeed be extended to other reasoning tasks, such as coding challenges or multi-modal reasoning, as long as two key conditions are met: (1) a task-specific partitioning of reasoning steps is feasible, and (2) a reliable success signal is available. We consider these applications promising directions for future exploration.

We hope our response addresses your concerns, and we welcome any further questions or suggestions.

Note – Signatures: ICLR.cc/2025/Conference/Submission11435/Authors
Note – Readers: everyone
Note – Writers: ICLR.cc/2025/Conference, ICLR.cc/2025/Conference/Submission11435/Authors
Note – Replyto: o86yAUzspP

Edit Info

Readers: Everyone

Writers: ICLR 2025 Conference

Signatures: ICLR 2025 Conference Submission11435 Authors

Official Comment Edit by Authors

21 Nov 2024, 16:54 Coordinated Universal Time

Title: Response by Authors
Comment:
We thank the reviewer for the valuable feedback and the interest in our work. Herein, we address the points raised by the reviewer:

1. Experiment Setup for Reproducibility

"More details on the experimental setup could be provided for reproducibility, including the reward thresholds for the Clip and Delta mechanisms and hyperparameters for PPO."
- We have added the hyperparameters for PPO in Appendix D.
- In our experiments, we set $\eta$ to be the average value of PRM rewards of all reasoning steps related to one question in a training batch. We implement the training pipeline based on ReaLHF [1] by implementing the success reward and the dense rewards provided by PRM.
[1] ReaL: Efficient RLHF Training for LLMs with Parameter Reallocation. https://github.com/openpsi-project/ReaLHF.

2. Experiments on Larger LLMs

"Smaller LLMs may offer a larger scope of improvement, and so the proposed methods may seem to have been successful. However, to confirm the advantage, experiments on larger LLMs may be necessary; e.g., the paper reports GPT-4o-2024-08-06's performance to be 92.9 on GSM8K, which is higher than almost all other models and variations, and offers less scope of improvement."
- Our main focus is investigate how to unleash the potential of PRM in RL training for LLM reasoning. Our main experiments over a diverse set of LLMs in Sec. 4 present a comparison between RL training with success rewards only and RL training that combines PR-Clip-Delta and success rewards. We believe our evaluation presents a solid empirical justification for the effectiveness of the proposed methods.
- If a more powerful PRM is available or additional data is available, we believe the proposed approaches in our work could also benefit stronger and larger models. However, these directions are out of the scope of our work and we leave them as future work.
3. Computational Overhead

"Any discussion on the computational overhead or ease of integration of Clip and Delta into existing workflows would be beneficial."
- Both the Clip and the Delta mechanisms are straightforward to integrate into existing workflows.
- The implementation of the Clip mechanism involves computing the mean of the reward as a threshold after reward calculation, followed by applying the formula specified in Eq. 5. This additional step is computationally lightweight and seamlessly fits within the existing reward processing pipeline.
- The Delta mechanism requires computing the difference between rewards from two adjacent steps, a process that is both conceptually simple and computationally efficient. As such, neither method introduces significant overhead, ensuring their ease of adoption.
- We have also added related discussion in Appendix D.
4. Extension to Other Tasks

"Although mathematical reasoning is a good testbed, any comments on the potential applicability of these techniques in non-mathematical or multi-modal reasoning tasks?"

We thank the reviewer for raising the important question of broader applicability. While our work focuses on mathematical reasoning, the proposed techniques are not inherently limited to this domain. The approach can indeed be extended to other reasoning tasks, such as coding challenges or multi-modal reasoning, as long as two key conditions are met: (1) a task-specific partitioning of reasoning steps is feasible, and (2) a reliable success signal is available. We consider these applications promising directions for future exploration.

We hope our response addresses your concerns, and we welcome any further questions or suggestions.

Note – Signatures: ICLR.cc/2025/Conference/Submission11435/Authors
Note – Readers: everyone
Note – Writers: ICLR.cc/2025/Conference, ICLR.cc/2025/Conference/Submission11435/Authors
Note – Replyto: o86yAUzspP

Edit Info

Readers: Everyone

Writers: ICLR 2025 Conference

Signatures: ICLR 2025 Conference Submission11435 Authors

Official Comment Edit by Authors

1. Experiment Setup for Reproducibility

2. Experiments on Larger LLMs

3. Computational Overhead

4. Extension to Other Tasks

Edit Info

Official Comment Edit by Authors

1. Experiment Setup for Reproducibility

2. Experiments on Larger LLMs

3. Computational Overhead

4. Extension to Other Tasks

Edit Info