\section{Training setting}
\label{train}
For the reinforcement learning method, we adopt TRICO~\cite{VAGEN} for training on qwen2.5-vl-32B, which is a PPO-based~\cite{schulman2017proximal}, more efficient MLLMs multi-turn reinforcement learning algorithm.
Specifically, we trained for 10.2 hours on 16 H100 GPUs, with the following hyperparameter settings: $\gamma_{\text{turn}}=0.95$, $\gamma_{\text{token}}=1.0$, $\text{KL penalty}=0.001$, Actor LR=$1 \times 10^{-6}$, and Critic LR=$1 \times 10^{-5}$.






