Mobile OS Task Procedure Extraction from YouTube

Yunseok Jang; Yeda Song; Sungryull Sohn; Lajanugen Logeswaran; Tiange Luo; Honglak Lee

Mobile OS Task Procedure Extraction from YouTube

Yunseok Jang, Yeda Song, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Honglak Lee

Published: 28 Oct 2024, Last Modified: 14 Jan 2025Video-Langauge Models PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Short Paper Track (up to 3 pages)

Keywords: video understanding, vision and language, large language model, vision language model, LLM, VLM, operating system, mobile, Mobile OS, procedure understanding, video summarization

TL;DR: We present MOTIFY, a novel method using Vision-Language Models to predict scene transitions and actions in mobile OS task videos. It requires no manual annotation, outperforms baselines, and aims to enable scalable mobile agent development.

Abstract: We present MOTIFY, a novel approach for predicting scene transitions and actions from mobile operating system (OS) task videos. By leveraging pretrained Vision-Language Models (VLMs), MOTIFY extract the task sequences from real-world YouTube videos without manual annotation. Our method addresses the limitations of existing approaches, which rely on manual data annotation or simulation environments. We demonstrate MOTIFY's effectiveness on a diverse set of mobile OS tasks across multiple platforms, outperforming baseline methods in scene transition detection and action prediction. This approach opens new possibilities for scalable, real-world mobile agent development and video understanding research.

Supplementary Material: zip

Submission Number: 48

Loading