Abstract: Highlights•A Cross-image GroupViT is proposed for learning semantically consistent feature representation.•A momentum-based updating method is used to learn semantically consistent features.•Image-level and token-level supervisions are proposed for learning global and local information.•The proposed CGViT shows superior performance on zero-shot semantic segmentation.
External IDs:dblp:journals/pr/JiangHZWL25
Loading