Abstract: Graphical user interface (GUI) agents face severe efficiency bottlenecks when processing long sequences of high-resolution screenshots, making inference costly and memory-bound. Existing KV cache compression methods, designed for natural images, remain suboptimal as they fail to exploit the unique spatial and temporal redundancies of GUIs. In this work, we first demonstrate that unlike natural images, GUI attention sparsity is uniformly high (>0.99) across all transformer layers, invalidating complex layer-varying budget strategies. Building on this insight, we introduce GUI-KV, a training-free compression method that allocates a uniform budget driven by two novel mechanisms: (1) spatial saliency guidance, which augments attention with residual stream L2 norms to preserve semantic visual tokens; and (2) temporal redundancy scoring, which employs subspace projection to identify and prune historical frames that are linearly redundant with the current view. Across six benchmarks, GUI-KV outperforms competitive baselines, often recovering near-full-cache accuracy at 10-20% budgets. Notably, on AgentNetBench, it reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Boqing_Gong1
Submission Number: 7262
Loading