GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Models; KV Cache Compression; GUI Agent
Abstract: Graphical user interface (GUI) agents face severe efficiency bottlenecks when processing long sequences of high-resolution screenshots, making inference costly and memory-bound. Existing KV cache compression methods, designed for natural images, remain suboptimal as they fail to exploit the unique spatial and temporal redundancies of GUIs. In this work, we first demonstrate that unlike natural images, GUI attention sparsity is uniformly high (>0.99) across all transformer layers, invalidating complex layer-varying budget strategies. Building on this insight, we introduce GUI-KV, a training-free compression method that allocates a uniform budget driven by two novel mechanisms: (1) spatial saliency guidance, which augments attention with residual stream L2 norms to preserve semantic visual tokens; and (2) temporal redundancy scoring, which employs subspace projection to identify and prune historical frames that are linearly redundant with the current view. Across six benchmarks, GUI-KV outperforms competitive baselines, often recovering near-full-cache accuracy at 10-20% budgets. Notably, on AgentNetBench, it reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20434
Loading