Keywords: test-time adaptation, continual adaptation, online learning, LLM agents, LLM memory
Abstract: Large language model (LLM) agents deployed in real systems often face distribution shift relative to training. This motivates test-time adaptation: improving agents' behavior after deployment using verifiable feedback (e.g., binary correctness signals or unit tests) rather than ground-truth training data. The design space for improving test-time performance is broad: it includes scaling inference compute (e.g., longer reasoning) and test-time adaptation methods that update context, external memory, or model parameters. We present a unified empirical characterization of test-time adaptation for LLM agents with verifiable feedback, and we measure adaptation compute in wall-clock time. We compare approaches spanning in-context adaptation (e.g., updating context or external memory) and fine-tuning via reinforcement learning under a data budget while tracking compute. We find that adaptation helps most when tasks benefit from improved reasoning using the model’s existing knowledge and skills, but offers limited gains when tasks demand learning new facts from the adaptation data. We summarize results via an accuracy vs. adaptation compute Pareto frontier showing the efficiency trade-offs across methods. By making these efficiency trade-offs explicit, this frontier helps practitioners choose the most suitable adaptation method for their deployment scenario.
Submission Number: 219
Loading