Agentic RL Training
===================

Last updated: 07/15/2025.

Overview
----------
The goal of Agentic RL is to improve the performance of backend models from reinforcement learning to the Agent. During the training process, a series of features are developed:

1. Server-based asynchronous rollout
2. Multi-turn conversations and tool calls
3. LangGraph-based Agent


This document explains the system principles and usage involved to help users implement Agentic RL.


Server-based Asynchronous Rollout
---------------------------------

Since Agents need to interact with the environment through various tool calls, in order to avoid GPU idling while waiting for tool call return results, an asyncio based co-routing mechanism is utilized to execute each rollout requests asynchronously, thereby improving training performance. To support asynchronous rollout, the inference engine (server) and the agent (client) are architecturally separated, implementing a server-based system with the following objectives:

1. Enabling load balancing mechanisms to balance loads across multiple GPUs and reduce the impact of long-tail requests on performance. For this purpose, scheduling capabilities in stream mode (recipe\stream_mode) are implemented as a recipe.
2. Preventing agent specific features such as tracing from affecting the inference engine.

System Architecture
~~~~~~~~~~~~~~~~~~~

.. image:: XXXX

For more detail on internal design, please refer to :doc:`Agent Loop<../advance/agent_loop>`.

System Components
~~~~~~~~~~~~~~~~~

+--------------------------+----------------------------------------------------------------------------+
| Component                | Role                                                                       |
+==========================+============================================================================+
| AgentLoop                | Client, implements Agent functions                                         |
+--------------------------+----------------------------------------------------------------------------+
| AsyncLLMServerManager    | Inference gateway, provides generate interface for AgentLoop               |
+--------------------------+----------------------------------------------------------------------------+
| AsyncServer              | Server, each instance is connected to one DP group of the inference engine |
+--------------------------+----------------------------------------------------------------------------+

**"generate" Interface**

The "generate" function based on ray actor is used between the Client and Server instead of the standard chat completion API. This is because the conversion between tokens and text can be irreversible. For example, the token converted from "<think>" will be different from that generated by the LLM. During the training phase, it is necessary to strictly use the tokens generated by LLM inference to avoid inaccurate in computing advantage, which may affect model performance. Having the Server provide a token-based API helps the Client maintain the relationship between the text generated by tool calls and the tokens returned by the LLM, so as to output correct tokens for training.


**Inference Engine Adaptation**
AsyncServer uniformly provides a generate function to the upper layer, with separate implementations for SGLang and vLLM to hide underlying differences:

1. The SGLang AsyncServer uses the async_generate interface of the SGLang engine, which is located on the first GPU of each TP group. Therefore, AsyncServer needs to remotely call async_generate through ray actor.
2. The vLLM AsyncServer uses the generate interface of the vLLM engine, which can communicate with the GPUs in the TP group through ZMQ and can be directly called in AsyncServer.


Usage Example
~~~~~~~~~~~~~

Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints.

There are two options required to use agent loop:

- `data.return_raw_chat=True`
- `actor_rollout_ref.rollout.mode=async`

This example uses the sglang inference engine by default, and you can also modify rollout_name to use vllm.

.. code-block:: bash

    bash examples/grpo_trainer/run_qwen2-7b_seq_balance.sh


Multi-turn Conversations and Tool Calls
---------------------------------------

Follow :doc:`Multi-turn Rollout Support<../sglang_multiturn/multiturn>` to prepare tool and configuration files.

The Tool Agent Loop has an additional requirement: adding an "agent_name" field to the dataset. During rollout, it will choose to use tool_agent_loop or single_turn_agent (default) based on this field.

Usage Example
~~~~~~~~~~~~~

.. code-block:: bash

    # install mlflow to view toolcall and llm trace
    pip install mlflow

    # This will download and preprocess the GSM8K dataset into ~/data/gsm8k/ and add the "agent_name" field.
    python examples/data_preprocess/gsm8k_tool_agent_loop.py

    # Start training with tool calls and enabled mlflow based trace helping to debug the rollout details
    bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_tool_agent_mlflow.sh

    # When training is done, start a mlflow server to view trace
    mlflow ui -h 0.0.0.0 -p 5000 --backend-store-uri sqlite:////tmp/mlruns.db

    # then you can open http://<your ip address>:5000 from browser to view trace


Note: During training, because the model may sometimes fail to generate correct toolcall tags, an error message "Failed to decode tool call" will be output to the console, which does not indicate an abnormality in training.


Follow :doc:`Rollout trace<../advance/rollout_trace>` to known more about trace feature.



Agent Framework
---------------

System Architecture
~~~~~~~~~~~~~~~~~~~

.. image:: XXXX

System Components
~~~~~~~~~~~~~~~~~

+--------------------------+-----------------------------------------------------------------------------------------------+
| Component                | Role                                                                                          |
+==========================+===============================================================================================+
| ChatModel                | LLM object of LangChain, used to adapt to the “generate” api provided by AsyncLLMServerManager|
+--------------------------+-----------------------------------------------------------------------------------------------+
| RectAgentLoop            | Agent adaptation layer, which by default supports a naive LangGraph Agentic.                  |
|                          | New classes can be derived to support user-defined Agents, and the run function needs to be   |
|                          | implemented to complete Agent calls.                                                          |
+--------------------------+-----------------------------------------------------------------------------------------------+
| AsyncServer              | Server, each instance is connected to one DP group of the inference engine.                   |
+--------------------------+-----------------------------------------------------------------------------------------------+


Follow doc "recipe/langgraph_agent/example/README.md" for more details.