Keywords: Large Language Models, Multi-turn RL, Self Reflection, Reasoning
Abstract: Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning with Verifiable Reward (RLVR) methods train large reasoning models on a single-turn paradigm. However, we observe that models trained with existing RL paradigms often fail to explore alternative reasoning paths across multiple turns and lack the capacity for self-reflection, resulting in repetitive and unadapted responses to contextual feedback. We ask: Can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (for example, “Let’s try again”) after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling models to reflect on prior failures and refine their reasoning accordingly. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn.
Submission Type: Research Paper (4-9 Pages)
NeurIPS Resubmit Bundle: pdf
NeurIPS Resubmit Summary: Reviewers found the idea simple and practical—adding unary textual feedback (“Let’s try again”) during multi-turn RL—while reporting solid multi-turn gains (≈14% Succ@5) without hurting single-turn performance. Main concerns were limited scope (one model/dataset, PPO-only), insufficient baselines (e.g., other multi-turn methods or heuristic feedback), and a need to clarify why feedback-as-observation differs from max-entropy or long CoT. In rebuttal we added broad experiments across multiple backbones and sizes (Qwen 1.5B/3B/7B; Llama 1B/3B), more tasks beyond math (e.g., TQA, GPQA, HotPotQA, MMLU/Pro), and GRPO; we also included heuristic and complex-prompt baselines and showed feedback yields consistent 4–8% gains over no-feedback multi-turn. We provided formal analysis showing sequential (feedback-aware) policies dominate parallel sampling, distinguishing our repetition penalty and trajectory-level credit assignment from single-turn max-entropy RL.
NeurIPS Resubmit Attestation: I am an author of the referenced NeurIPS 2025 submission. I have the right to share the anonymous reviews/meta-review for the exclusive use of the workshop PCs/reviewers. I understand they will not be redistributed publicly.
Submission Number: 151
Loading