RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

Aleksei Arzhantsev; Otmane Sakhi; Flavian Vasile

RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning Language Models, Offline RL, Self-Supervised Learning

TL;DR: We propose a simple, scalable offline RL method that leverages self-generated majority-vote signals to improve LLM reasoning performance.

Abstract: Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.

Submission Number: 116

Loading