Asynchronous Human-Agent Rollout for Long-Horizon Task Training

Asynchronous Human-Agent Rollout for Long-Horizon Task Training

ACL ARR 2026 January Submission3685 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agents, human-agent interaction, long-horizon

Abstract: Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for **long-horizon tasks** that can **take days or months**. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50\% improvement over the untrained baseline and a 28\% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of \textsc{Apollo}'s design in handling long-horizon, domain-specialized tasks.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: LLM/AI agents, code models

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 3685

Loading