Keywords: Test-Time Learning, Self-Evolving Agents, Curriculum Learning
Abstract: Test-time learning enables large language model (LLM) agents to adapt during inference without costly retraining, yet prior work largely treats test-time experience as equally useful. We ask a simple question: *what data should agents learn from at test time?* Focusing on task selection and ordering for context-based adaptation, we hypothesize that redundant or overly simple examples offer diminishing returns, while curated curricula improve sample efficiency. Using the Agentic Context Engineering (ACE) framework, we evaluate on the AppWorld benchmark featuring tool-use and coding agents. We show that careful data selection can match full-dataset performance using only $\sim$30\% of training tasks, and that task ordering measurably affects learning outcomes. Our results position curriculum curation as a first-class design dimension for efficient test-time agent learning and practical deployment.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 58
Loading