Learning to Act Anywhere: Experience-Based Similarity for Universal Interface Agents

ACL ARR 2026 January Submission10174 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: UI Grounding, Vision–Language Alignment, Training‑Free Methods, Inference‑Time Fusion, Cross‑Platform Generalization
Abstract: Large multimodal model (LMM) based interface agents often fail to generalize under interface perturbations or cross operating system (OS) shifts due to reliance on environment specific mappings and brittle grounding mechanisms. We present Universal-VLA, a training free framework for UI grounding that adapts past interaction experiences at inference time. Universal-VLA mitigates a practical limitation of dual branch architectures by performing vision language alignment within a shared Contrastive Language Image Pretraining (CLIP) latent space, while separately leveraging Optical Character Recognition (OCR) based text similarity and combining modalities via a simple max fusion strategy. We further introduce Elastic Visual Memory, a lightweight retrieval module that provides experience based priors without additional training. On the real world ScreenSpot-v2 benchmark, Universal-VLA generalizes across Android, iOS, and Web platforms. Universal-VLA achieves near-ceiling robustness on the diagnostic Evo-UI++ benchmark (98.4\% on icon-only tasks) and a comparable 32.0\% end-to-end task success on the real-world ScreenSpot-v2 benchmark, outperforming existing training-free baselines while maintaining an 83ms per-step latency. Overall, Universal-VLA offers an efficient and privacy preserving alternative to computation heavy UI agents.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: AI / LLM Agents, Multimodality and Language Grounding to Vision, Robotics and Beyond
Languages Studied: English
Submission Number: 10174
Loading