LLaVA-Mob: Efficient Large Language and Vision Assistant for Mobile

ACL ARR 2024 December Submission1223 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in mobile GUI automation have leveraged multimodal large language models (MLLMs) for task automation. However, deploying these models on mobile devices poses significant challenges, including high computational costs, suboptimal performance, and limited adaptability to mobile-specific contexts. In this paper, we propose LLaVA-Mob, a lightweight multimodal agent designed for efficient smartphone GUI automation. LLaVA-Mob features a compact 1B-parameter language model and a GUI-optimized vision encoder, specifically tailored for mobile environments. Additionally, we introduce a synthetic data generation approach to produce high-quality, domain-aligned datasets, enhancing alignment between visual and textual modalities. Experiments on the AITW dataset demonstrate that LLaVA-Mob achieves performance comparable to larger models while significantly reducing computational costs, making it well-suited for resource-constrained mobile platforms. We will release our code, model, and datasets upon publication.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Mobile agent, Efficient, Synthetic data
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 1223
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview