Abstract: Recent advancements in mobile GUI automation have leveraged multimodal large language models (MLLMs) for task automation. However, deploying these models on mobile devices poses significant challenges, including high computational costs, suboptimal performance, and limited adaptability to mobile-specific contexts. In this paper, we propose LLaVA-Mob, a lightweight multimodal agent designed for efficient smartphone GUI automation. LLaVA-Mob features a compact 1B-parameter language model and a GUI-optimized vision encoder, specifically tailored for mobile environments. Additionally, we introduce a synthetic data generation approach to produce high-quality, domain-aligned datasets, enhancing alignment between visual and textual modalities. Experiments on the AITW dataset demonstrate that LLaVA-Mob achieves performance comparable to larger models while significantly reducing computational costs, making it well-suited for resource-constrained mobile platforms. We will release our code, model, and datasets upon publication.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Mobile agent, Efficient, Synthetic data
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 1223
Loading