SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis

ACL ARR 2026 January Submission10609 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Agents, Data Generation, Vision-Language Models, Human-like Gestures, Benchmark
Abstract: With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling \textit{swipe} interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline \texttt{SwipeGen} to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose \texttt{GUISwiper}, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that \texttt{GUISwiper} achieves a swipe execution accuracy of 69.07\%, representing a 214\% improvement over existing VLM baselines. Our code, dataset, and model are available at \url{https://anonymous.4open.science/r/UI-anoy-91BC/}.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: AI / LLM Agents, Generation, Language Modeling
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, Chinese
Submission Number: 10609
Loading