UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Zhengxi Lu; Jiabo Ye; Fei Tang; Yongliang Shen; Haiyang Xu; Ziwei Zheng; Weiming Lu; Ming Yan; Fei Huang; Jun Xiao; Yueting Zhuang

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang

15 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Agent, Reinforcement learning, Large Language Model

TL;DR: We present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories.

Abstract: Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning (RL). However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present $\textbf{Semi-online Reinforcement Learning}$, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance ($\textbf{SOP}$), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours $\textbf{UI-S1-7B}$ achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0\% on AndroidWorld, +23.8\% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5417

Loading