SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autoregressive Image Generation, KV Cache Compression
TL;DR: We design an efficient kv cache compression method for autoregressve image generation.
Abstract: Autoregressive image generation models like Janus-Pro produce high-quality images, but at the cost of high memory and computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it remains largely unexplored for image generation. In this work, we begin by identifying a distinct attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for visual tokens by decoupling attention heads into two types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it preserves a compact set of highly-attended tokens. Experiments demonstrate that our method achieves a 5$\times$ reduction in memory usage and a 6.6$\times$ speedup in throughput with negligible performance loss, enabling efficient native autoregressive image generation.
Primary Area: generative models
Submission Number: 1464
Loading