MORPHEUS: A Persistent Enterprise Benchmark for Continual RL in the Big World

Pavan Seshadri; Ankit Kumar; Anish Chhabra; Shashank Tonpekar; Sriram Ganapathi Subramanian; Kaheer Suleman; Sam Pasupalak; Sumit Pasupalak

MORPHEUS: A Persistent Enterprise Benchmark for Continual RL in the Big World

Pavan Seshadri, Ankit Kumar, Anish Chhabra, Shashank Tonpekar, Sriram Ganapathi Subramanian, Kaheer Suleman, Sam Pasupalak, Sumit Pasupalak

Published: 10 Jun 2026, Last Modified: 10 Jun 2026RL in Big Worlds PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual Reinforcement Learning, Big World Hypothesis, Benchmark

TL;DR: A new benchmark for Continual Reinforcement Learning in Big Worlds.

Abstract: We introduce Morpheus, a persistent enterprise simulation platform for continual reinforcement learning (CRL) research. Existing reinforcement learning (RL) benchmarks are episodic, low-dimensional, and stationary by design, properties antithetical to the challenges of real-world deployed systems. Grounded in the Big World Hypothesis, Morpheus provides environments in which the world never resets to an initial state, objectives shift over time, and past decisions have compounding consequences. Morpheus comprises four enterprise simulation environments; we evaluate two in this paper, drawn from outbound logistics and inbound warehouse operations, each exhibiting structured non-stationarity through a parameterisable failure injection engine and an asynchronous configuration shift controller. Policies are initialised via supervised fine-tuning on API-collected trajectories and subsequently trained with PPO-based reinforcement learning. Rewards are computed from operational verifiers embedded in the platform: structured failure event signals, financial ledger status, and resource throughput. We define a formal benchmark specification, propose a six-metric evaluation protocol covering per-configuration reward, adaptation speed, forgetting, recovery time, stability, and performance gap relative to a configuration-specific theoretical upper bound, and establish baseline results across four algorithm families: standard RL, replay-based, regularisation-based, and latent context modelling.

Submission Number: 16

Loading