OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents

ACL ARR 2025 February Submission1094 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: {\bf over-execution}, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce \textit{OS-Kairos}, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. \textit{OS-Kairos} is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that \textit{OS-Kairos} substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59\%$\sim$87.29\% improvements in task success rate. \textit{OS-Kairos} facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at Anonymous.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: GUI agent, MLLM agent, Confidence evaluation, Adaptive interaction, Human-in-the-loop
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1094
Loading