Keywords: Hard Example Mining; Vision-language model; GUI agent;GUI Grounding
Abstract: The core capability of a Graphical User Interface (GUI) agent based on a Multimodal Large Language Model (MLLM) relies on accurate GUI grounding, which precisely locates actionable elements in screenshots according to instructions. The core challenges in traditional fine-tuning are low data efficiency and small object grounding. Supervised Fine-Tuning (SFT), as a mainstream approach, requires massive datasets. While rule-based Reinforcement Fine-Tuning (RFT) offers improvements, it still fails to accurately filter useful data from overwhelming redundancy. Most samples are easy to learn and contribute little to model performance improvement. Inspired by the human learning mechanism of "Problem-Type-Specific Retraining," this paper constructs a decoupled visual concept library to acquire high-value retraining resources. Based on this library, we propose IconBank, a hard sample mining framework. Our key finding is that only a minimal number of carefully selected difficult samples can achieve performance comparable to, or even better than, training with massive data.Specifically, we first extract operable elements from multiple open-source GUI datasets to build a unified decoupled visual concept library (IconBank), where "Icon" is redefined as pure visual atomic concepts stripped of context, background, and layout. Next, we search for similar elements through the decoupled visual concept library and select targeted practice samples to form a minimal refined training set.Experimental results show that a 3B model trained on only 2K samples achieves a score of 51.7% on the ScreenSpot-Pro benchmark, surpassing most 7B models. This significant effectiveness verifies the assumption of massive redundancy in GUI data and reveals that data quality (diversity and challenge) is far superior to quantity.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 11295
Loading