Abstract: We stand at the edge of a new challenge in AI: deploying language models, steered by reinforcement
learning methods, as agents on tasks that are genuinely complex, multi-dimensional, and subjective.
While, standard reinforcement learning methods assume that a single scalar reward is sufficient to guide
learning, we argue that this assumption is structurally false for the class of tasks that matters most – those
where success requires balancing multiple competing dimensions of quality and where no single ground
truth exists against which to verify an output. Scalar feedback cannot tell a model which dimension of
its attempt failed, let alone how much that dimension matters. This manuscript advances a position:
that effective training on complex tasks requires making the task’s dimensional structure explicit within
the training process itself, and sustaining that structure across episodes, not just within them. Sub-task
decomposition and experiential learning are not independent improvements: the decomposition creates
a stable sub-task vocabulary that makes experience indexable and transferable; experiential learning
populates that vocabulary with knowledge that compounds over time. Neither is fully valuable without
the other – decomposition without accumulated experience means every episode rediscovers the same
sub-task strategies from scratch, while experience accumulated without a decomposition collapses into
a flat memory that cannot transfer across tasks. We introduce Reinforcement Learning with World
Model Feedback (RLWM), a training paradigm built around five pillars: hierarchical decomposition of
tasks into weighted sub-components, recurrent structured reflection over those components, relational
constraints governing how sub-components interact, stabilization mechanisms that preserve the signal
from rare successes, and a persistent graph-structured world model that accumulates experience indexed
by the same sub-task vocabulary that structures feedback – enabling what each episode teaches about a
competency to transfer across episodes, jobs, and agents. Together, these pillars offer a foundation for
agents to learn from experience on the generic tasks that have verifiable rewards.
Loading