Sandbox-RL: Scalable Multi-LLMs Optimization through Sandbox-Based Reinforcement Learning

08 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Efficient Reinforcement Learning; Multi-Model Reinforcement Learning
Abstract: We introduce \textbf{Sandbox-RL}, a framework for scalable multi-LLMs optimization that enables heterogeneous language models to efficiently co-train within shared sandbox environments. Unlike traditional multi-agent systems that rely on inter-agent communication, Sandbox-RL orchestrates multiple LLMs with different architectures and specializations (Qwen2.5-7B, Llama 3.1-7B/8B, Llama 3.2-3B) as a learnable population within structured workflow graphs composed of modular \textit{sandbox environments} with strong isolation properties. Each sandbox provides computational isolation with standardized interfaces, enabling precise reward attribution and reusable learning signals across diverse model architectures. The framework introduces temperature-regularized population-level optimization that adapts to heterogeneous model capabilities through competence matrices and cooperation temperature parameters. Our system features a KVCache-centric optimization architecture with distributed memory pools, intelligent prefill-decoding scheduling, and RDMA-based inter-node transfer protocols. Comprehensive evaluation across Qwen and Llama model families demonstrates that Sandbox-RL achieves superior performance-efficiency trade-offs: Llama 3.1-8B attains highest performance (0.978 score) with fastest convergence (38 epochs) in OASIS information spread, while Llama 3.2-3B provides optimal efficiency (0.952 memory efficiency, 120.3ms latency), validating the effectiveness of our scalable multi-LLMs optimization approach.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 3075
Loading