NitroBox: Lightning-Fast Sandbox for Large-Scale RL Training
Keywords: sandbox, agentic rl
TL;DR: NitroBox is a light weight and fast sandbox designed for large-scale evaluation, RL training and agent sandbox execution
Abstract: Recent work on agentic training efficiency has mainly focused on improving learning algorithms, rollout strategies, or model-hardware co-design. However, these approaches overlook an important source of training latency: the sandbox that runs the environments. Recent studies show that creating, executing, and resetting containers accounts for an important portion of the rollout process in RL training, substantially slowing down overall training. This overhead largely stems from the centralized management mechanisms (e.g., DAEMON) of existing sandboxes, along with additional features (e.g., multi-tenancy) that are unnecessary for reinforcement learning (RL). Moreover, existing sandboxes are designed for general-purpose isolated execution, making them a poor fit for the requirements of RL rollout and training. We propose NitroBox, a novel sandbox system purpose-built for agentic RL that significantly accelerates the rollout and training process. Technically speaking, we first comprehensively redesign NitroBox's lifecycle to eliminate the dependency on DAEMON, so that every sandbox instance is fully independent during creation, startup, and reset (e.g., via rootless full-namespace isolation). Furthermore, we introduce a copy-on-write mechanism to enable fast startup across multiple instances. Second, we knock off features that are time-consuming but unnecessary for RL — such as multi-tenancy — while preserving the core functionality and security guarantees needed for RL training. Finally, we introduce some RL-nature features to further improve efficiency, such as optional CRIU-based process checkpoint saving and restoring for mid-trajectory branching. Our experiments show that NitroBox is faster than Docker across the entire lifecycle: 10.6× faster in sandbox creation, 24.6× in reset, and 19.1× in deletion. When training Qwen2.5-Coder-32B-Instruct with GRPO at c=32 concurrency, end-to-end wall-clock training time drops from 40.6 hours to 28.4 hours when replacing Docker with NitroBox, while preserving the same task pass rate.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 231
Loading