LLaVA-Mob: Efficient Large Language and Vision Assistant for Mobile

LLaVA-Mob: Efficient Large Language and Vision Assistant for Mobile

ACL ARR 2024 December Submission1223 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advancements in mobile GUI automation have leveraged multimodal large language models (MLLMs) for task automation. However, deploying these models on mobile devices poses significant challenges, including high computational costs, suboptimal performance, and limited adaptability to mobile-specific contexts. In this paper, we propose LLaVA-Mob, a lightweight multimodal agent designed for efficient smartphone GUI automation. LLaVA-Mob features a compact 1B-parameter language model and a GUI-optimized vision encoder, specifically tailored for mobile environments. Additionally, we introduce a synthetic data generation approach to produce high-quality, domain-aligned datasets, enhancing alignment between visual and textual modalities. Experiments on the AITW dataset demonstrate that LLaVA-Mob achieves performance comparable to larger models while significantly reducing computational costs, making it well-suited for resource-constrained mobile platforms. We will release our code, model, and datasets upon publication.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Mobile agent, Efficient, Synthetic data

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 1223

Loading