Abstract: The visual understanding has 3 levels: image, patch and pixel. Open Vocabulary semantic segmentation benefits from the power of Vision-Language model (VLM), yet still struggles in transferring whole image to pixel-wise understanding. Visual Tokenization compresses pixels in patch-level with marginal information loss, but the visual tokens have no semantic meaning. In this paper, we consider segmentation as pixel-level tokenization and study a unified perceptual semantic feature compression, which uses multi-granular patch tokenization as bridge for understanding propagation and thereby facilitates segmentation. Referring to the cognitive process of pretrained VLM where the low-level features compose the high-level semantics, we propose Feature Pyramid Tokenization (PAT) to cluster and quantize multi-resolution features by codebook as meta semantic tokens and then decode them by 2 tasks: pixel reconstruction and semantic segmentation. To avoid interference between tasks, we design loosely coupled pixel and semantic learning branches, where the former focus on clustering visual tokens and the latter provides auxiliary semantic guidance. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid, improves performance over the baseline segmentation model and achieves competitive performance on open vocabulary semantic segmentation benchmark. Our model is parameter-efficient for VLM integration and flexible for the independent tokenization. We hope to give inspiration not only on improving segmentation but also on semantic visual token utilization.
External IDs:dblp:conf/icann/ZhangZL25
Loading