Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0
Keywords: self-improvement, self-play, reasoning, LLMs, reinforcement learning
TL;DR: We show that LLMs stuck on sparse-reward, difficult math problems can self-improve by self-generating a "stepping-stone" curriculum with grounded asymmetric meta-RL. This avoids relying on curated intermediate data or unstable intrinsic rewards.
Abstract: RL methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? We explore this with SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher model proposes synthetic problems for a student model, and is rewarded with its improvement on a subset of hard problems, grounding the curriculum in real student progress rather than proxy rewards. Our study on the hardest subsets of math benchmarks (0/128 success) reveal three core findings. First, it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful problems. Second, grounded rewards outperform intrinsic rewards used in prior LLM self-play, reliably the typical instability and diversity collapse modes. Third, the structure and well-posedness of questions are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the ability to solve hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 34
Loading