Track: Short Paper Track (up to 3 pages)
Keywords: video understanding, vision and language, large language model, vision language model, LLM, VLM, operating system, mobile, Mobile OS, procedure understanding, video summarization
TL;DR: We present MOTIFY, a novel method using Vision-Language Models to predict scene transitions and actions in mobile OS task videos. It requires no manual annotation, outperforms baselines, and aims to enable scalable mobile agent development.
Abstract: We present MOTIFY, a novel approach for predicting scene transitions and actions from mobile operating system (OS) task videos. By leveraging pretrained Vision-Language Models (VLMs), MOTIFY extract the task sequences from real-world YouTube videos without manual annotation. Our method addresses the limitations of existing approaches, which rely on manual data annotation or simulation environments. We demonstrate MOTIFY's effectiveness on a diverse set of mobile OS tasks across multiple platforms, outperforming baseline methods in scene transition detection and action prediction. This approach opens new possibilities for scalable, real-world mobile agent development and video understanding research.
Supplementary Material: zip
Submission Number: 48
Loading