Compact GSPN: Scaling Spatial Propagation to Vision Foundation Models

yitong jiang; Collin McCarthy; Hongjun Wang; Hanrong Ye; Qi Dou; Tianfan Xue; Jinwei Gu; Jan Kautz; Hongxu Yin; Pavlo Molchanov; Sifei Liu

Compact GSPN: Scaling Spatial Propagation to Vision Foundation Models

yitong jiang, Collin McCarthy, Hongjun Wang, Hanrong Ye, Qi Dou, Tianfan Xue, Jinwei Gu, Jan Kautz, Hongxu Yin, Pavlo Molchanov, Sifei Liu

08 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient vision foundation model

Abstract: Scaling vision foundation models is limited by the quadratic cost of self-attention. Generalized Spatial Propagation Networks (GSPN) provide a linear-time alternative that propagates context directly on the 2D grid and removes positional embeddings, but have not been scaled to foundation-level training. We present Compact GSPN (C-GSPN), a ViT block with a compressed propagation space that preserves accuracy while cutting propagation latency by nearly 10×, complemented by lightweight projections and fused CUDA kernels for further efficiency. To pretrain at scale, we use a two-stage distillation scheme with module-wise supervision and end-to-end alignment. In a representative 1K configuration (batch32, C=1152), C-GSPN yields up to 2× speedup, while maintaining competitive zero-shot accuracy and improving segmentation by +2.1%. Extensive experiments and ablations confirm that the proposed compression and two-stage distillation are key to achieving strong transfer while substantially reducing compute, offering a practical path toward subquadratic vision foundation models.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3028

Loading