Keywords: LLM Agents, Training-Free Learning, Group Relative Policy Optimization
Abstract: Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates such as Supervised Fine-Tuning (SFT) or Group Relative Policy Optimization (GRPO) to alter output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by introducing a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-free Group Relative Policy Optimization (Training-free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages minimal ground-truth data to perform multiple rollouts, where a group-based relative scoring mechanism is applied to iteratively distill high-quality experiential knowledge in each epoch. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-free GRPO, when applied to DeepSeek-V3.1, significantly improves out-of-domain performance.
With just a few dozen training samples, Training-free GRPO outperforms fine-tuned small LLMs and achieves competitive results. Our code is available at https://anonymous.4open.science/r/Training-Free-GRPO/.
Primary Area: reinforcement learning
Submission Number: 19858
Loading