Double-Helix Co-Training for Computer-Use Generator and Verifier Models
Keywords: Reward Model, Compute Use Agent, Adversarial Training
Abstract: Reinforcement learning for computer-use agents requires accurate reward signals, but such signals are often hard to obtain at scale. Large language model (LLM) judges provide a convenient source of feedback, yet they can be biased toward fluent model-generated behavior and yield false positives. We propose a generator--verifier co-training framework, which we refer to as Double-Helix Co-Training. The verifier is trained using matched-state preferences that compare trusted demonstration actions, optionally including checker-filtered transitions, against the generator's counterfactual alternatives at the same history state. The generator is optimized using verifier-induced rewards, and step scores are aggregated into a conservative trajectory reward via a min or soft-min operator. We further interpret the alternating updates through a generalized EM view, which motivates trust-region regularization and clarifies the stabilizing dynamics. Empirically, our verifier achieves state-of-the-art precision on AgentRewardBench, and the co-trained 7B generator is a strong byproduct that is competitive on ScreenSpot-V2, ScreenSpot-Pro, and Mind2Web.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 72
Loading