Reinforcement Learning with World Model Feedback

Published: 05 May 2026, Last Modified: 07 May 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY 4.0
Abstract: We stand at the edge of a new challenge in AI: deploying language models, steered by reinforcement learning methods, as agents on tasks that are genuinely complex, multi-dimensional, and subjective. While, standard reinforcement learning methods assume that a single scalar reward is sufficient to guide learning, we argue that this assumption is structurally false for the class of tasks that matters most – those where success requires balancing multiple competing dimensions of quality and where no single ground truth exists against which to verify an output. Scalar feedback cannot tell a model which dimension of its attempt failed, let alone how much that dimension matters. This manuscript advances a position: that effective training on complex tasks requires making the task’s dimensional structure explicit within the training process itself, and sustaining that structure across episodes, not just within them. Sub-task decomposition and experiential learning are not independent improvements: the decomposition creates a stable sub-task vocabulary that makes experience indexable and transferable; experiential learning populates that vocabulary with knowledge that compounds over time. Neither is fully valuable without the other – decomposition without accumulated experience means every episode rediscovers the same sub-task strategies from scratch, while experience accumulated without a decomposition collapses into a flat memory that cannot transfer across tasks. We introduce Reinforcement Learning with World Model Feedback (RLWM), a training paradigm built around five pillars: hierarchical decomposition of tasks into weighted sub-components, recurrent structured reflection over those components, relational constraints governing how sub-components interact, stabilization mechanisms that preserve the signal from rare successes, and a persistent graph-structured world model that accumulates experience indexed by the same sub-task vocabulary that structures feedback – enabling what each episode teaches about a competency to transfer across episodes, jobs, and agents. Together, these pillars offer a foundation for agents to learn from experience on the generic tasks that have verifiable rewards.
Loading