Training Recipes for Agentic Reinforcement Learning in LLMs: A Survey

Training Recipes for Agentic Reinforcement Learning in LLMs: A Survey

ACL ARR 2026 January Submission3280 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agentic Reinforcement Learning, Large Language Models, Training Recipes, Survey, Contextual POMDP, Training Schemes, Training Infrastructure, Training Environments, Evaluation Benchmarks

Abstract: Standard alignment pipelines like RLHF often fail to instill the robustness required for complex agentic tasks. This survey systematizes Agentic Reinforcement Learning, a paradigm that optimizes policies directly against environmental feedback within a Contextual POMDP framework. Unlike prior reviews cataloging agent capabilities, we focus on the engineering of training recipes necessary to build generalist agents from scratch. Our taxonomy examines schemes for rollout and optimization, infrastructure for high-throughput training, and environments that formalize interaction settings, alongside evaluation benchmarks distinguishing active training gyms from held-out certifications. By synthesizing these disparate components into a unified framework, we aim to accelerate the development of robust, next-generation autonomous agents.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: LLM/AI agents, reinforcement learning, fine-tuning, safety and alignment, benchmarking, chain-of-thought, optimization methods, task-oriented, embodied agents, reasoning, evaluation methodologies, retrieval-augmented generation

Contribution Types: Surveys

Languages Studied: English

Submission Number: 3280

Loading