SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis

SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis

ACL ARR 2026 January Submission10609 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Agents, Data Generation, Vision-Language Models, Human-like Gestures, Benchmark

Abstract: With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling \textit{swipe} interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline \texttt{SwipeGen} to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose \texttt{GUISwiper}, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that \texttt{GUISwiper} achieves a swipe execution accuracy of 69.07\%, representing a 214\% improvement over existing VLM baselines. Our code, dataset, and model are available at \url{https://anonymous.4open.science/r/UI-anoy-91BC/}.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: AI / LLM Agents, Generation, Language Modeling

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Chinese

Submission Number: 10609

Loading