Keywords: Reinforcement learning, LLM agents, tool use, function calling, GRPO, environment design, transfer learning, rubric-based evaluation
Abstract: We present evidence that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. Using \corecraft{}, a realistic customer support simulation environment with expert-designed tasks and verifiable rubrics, we train GLM~4.6 using Group Relative Policy Optimization (GRPO) with adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks (+11.39 percentage points), surpassing Claude Opus 4.5 (33.49\%) and approaching GPT-5.1 High (36.86\%). Critically, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel and +7.4\% on $\tau^2$-Bench Retail. We attribute this transfer to three design principles: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and realistic enterprise workflows that mirror genuine professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities, and that well-designed evaluation environments can serve as effective training substrates.
Submission Number: 91
Loading