Abstract: Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable capabilities in understanding static screenshots. However, a key aspect of building a robust GUI automation system is understanding dynamic GUI actions such as videos depicting fundamental GUI actions, which enables agents to learn from human demonstrations. This is a non-trivial task that is distinct from natural scene video captioning: (i) GUI screenshots contain more concentrated information than natural scenes due to their high-resolution environment. (ii) Events in GUI videos occurred more quickly, requiring attention on time-span detection. (iii) Frames in GUI videos with less information increase unnecessary computational costs for captioning. To address these challenges, we propose Act2Cap, a new video captioning benchmark specifically designed for GUI action videos, comprising 10,866 diverse video caption pairs containing not only temporal information of keyframes but also detailed narration on action types, elements, location, and purpose. In addition, we propose GUI Narrator, a framework utilizing cursor detection to enhance action interpretation in high-resolution screenshots. Our framework demonstrates improved performance in both open-source models and as a plug-and-play solution for closed-source models while reducing computational costs. The datasets and models are available at https://github.com/showlab/GUI-Narrator.
External IDs:doi:10.1145/3746027.3755150
Loading