Keywords: LLM Agents, human-agent interaction, long-horizon
Abstract: Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation.
However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for **long-horizon tasks** that can **take days or months**. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks.
We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost.
To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50\% improvement over the untrained baseline and a 28\% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of \textsc{Apollo}'s design in handling long-horizon, domain-specialized tasks.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: LLM/AI agents, code models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 3685
Loading