Keywords: LLM Agents, Training-Free Learning, Group Relative Policy Optimization
Abstract: Reinforcement Learning (RL) has emerged as a pivotal strategy for adapting Large Language Model (LLM) agents to specialized domains and complex tool-use scenarios. However, existing approaches typically instantiate the policy as a parameterized LLM, relying on gradient-based updates such as Group Relative Policy Optimization (GRPO). This paradigm incurs prohibitive computational costs and risks catastrophic forgetting, often making it impractical for resource-constrained scenarios. In this work, we propose a fundamental rethinking of agentic RL by introducing Training-Free Group Relative Policy Optimization (Training-Free GRPO). It instantiates the policy as a frozen LLM paired with a variable experiential context, shifting optimization from the parameter space to the context space. Mirroring the iterative structure of vanilla GRPO, our method replaces gradient descent with multi-epoch RL learning by introspecting on groups of trial-and-error rollouts, where the LLM extracts a semantic group advantage to iteratively refine its problem-solving experiences without parameter updates. Experiments on mathematical reasoning and web search tasks demonstrate that Training-Free GRPO establishes a new Pareto frontier between test-time performance and learning cost. Also, we show that applying our method to a frozen flagship LLM like DeepSeek-V3.1-Terminus using merely 100 training samples yields superior performance to fully fine-tuning a 32B LLM, while slashing learning costs by orders of magnitude from 800 dollars to 8 dollars. It offers a highly effective and accessible pathway for optimizing LLM behaviors in real-world applications.
Primary Area: reinforcement learning
Submission Number: 19858
Loading