Keywords: Reinforcement Learning, Large Reasoning Model
Abstract: Large Language Models (LLMs) are increasingly deployed as agents that solve problems through multi-turn interaction, receiving feedback and refining their reasoning based on users' feedback. However, existing reinforcement learning with verifiable reward (RLVR) methods train them under a single-turn paradigm. As a result, we discovered that models often fail to explore alternative reasoning paths or reflect on prior mistakes, producing repetitive and unadapted responses to feedback.
To address this gap, we propose Unary Feedback as Observation (UFO), a framework that conditions policy updates on minimal unary feedback (e.g., “Let’s try again”) after incorrect answers. UFO is simple, compatible with existing single-turn RL setups, and incentivizes self-reflection. To further promote efficient and adaptive reasoning, we design reward structures that encourage \emph{minimality} (solving in fewer turns) and \emph{diversity} (exploring alternatives under failure). Experiments show that UFO preserves single-turn performance while improving multi-turn reasoning accuracy by about 14\%. Crucially, UFO-trained models also generalize beyond their training domain, transferring effectively to out-of-domain tasks across mathematics, STEM, QA, and general knowledge, showing that UFO teaches models self-reflective reasoning that carry over across domains. Beyond these empirical gains, UFO points toward a broader paradigm for building adaptive reasoning agents: one that scales supervision from static datasets, reduces dependence on costly domain-specific feedback, and lays the foundation for more general, self-improving AI systems in open-ended real-world settings.
Submission Number: 28
Loading