Unary Feedback as Observation: Incentivizing Self-Reflection in Large Language Models via Multi-Turn RL

Licheng Liu; Zihan Wang; Linjie Li; Chenwei Xu; Yiping Lu; Han Liu; Avirup Sil; Manling Li

Unary Feedback as Observation: Incentivizing Self-Reflection in Large Language Models via Multi-Turn RL

Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Multi-turn RL, Self Reflection, Reasoning

Abstract: Large Language Models (LLMs) are increasingly deployed as agents that solve problems through multi-turn interaction, receiving feedback and refining their reasoning based on users' feedback. However, existing reinforcement learning with verifiable reward (RLVR) methods train them under a single-turn paradigm. As a result, we discovered that models often **fail to explore alternative reasoning paths or reflect on prior mistakes, producing repetitive and unadapted responses to feedback.** To address this gap, we propose Unary Feedback as Observation (UFO), a framework that conditions policy updates on minimal unary feedback (e.g., “Let’s try again”) after incorrect answers. UFO is simple, compatible with existing single-turn RL setups, and incentivizes self-reflection. To further promote efficient and adaptive reasoning, we design reward structures that encourage _minimality_ (solving in fewer turns) and _diversity_ (exploring alternatives under failure). Experiments show that UFO preserves single-turn performance while improving multi-turn reasoning accuracy by about 14%. Crucially, UFO-trained models also **generalize beyond their training domain, transferring effectively to out-of-domain tasks** across mathematics, STEM, QA, and general knowledge, showing that **UFO teaches models self-reflective reasoning that carry over across domains**. Beyond these empirical gains, UFO points toward a broader paradigm for building adaptive reasoning agents: one that scales supervision from static datasets, reduces dependence on costly domain-specific feedback, and lays the foundation for more general, self-improving AI systems in open-ended real-world settings.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12212

Loading