Keywords: LLM, Agents
Abstract: While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting.
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.
JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.
These estimates are then used to directly modulate the LLM's output logits.
We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective.
Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 \times, offering a scalable path for continual learning agents. The code is available at https://anonymous.4open.science/r/JitRL-D485.
Submission Number: 9
Loading