Logits Distributions Imply GUI Agent Model Confidence in Coordinate Predictions

Logits Distributions Imply GUI Agent Model Confidence in Coordinate Predictions

ACL ARR 2025 May Submission3181 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have demonstrated impressive capabilities in understanding and interacting with operating system environments. However, despite their strong task performance, these models often exhibit hallucinations—systematic errors in action prediction that compromise reliability. In this study, we conduct a comprehensive analysis of the hallucinatory behaviors exhibited by GUI agent models in an icon localization task. We introduce a novel evaluation framework that moves beyond traditional accuracy-based metrics by categorizing model predictions into four distinct types: correct predictions, biased hallucinations, misleading hallucinations, and confusing hallucinations. This fine-grained classification provides deeper insights into model failure modes. Furthermore, we investigate the distribution of output logits corresponding to different response types and reveal key deviations from the behavior observed in traditional classification tasks. To support this analysis, we propose a new metric derived from the structural characteristics of the logits distribution, offering a fresh perspective on model confidence and uncertainty in GUI interaction tasks.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering; cross-modal application; multimodal applications;

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 3181

Loading