ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy
Abstract: Highlights•The CLIP model cannot effectively perceive pixel-level regions.•Semantic space adjustment strategy can enable CLIP to effectively perceive regions.•A single text [CLS] token is insufficient to guide the segmentation task.•The vision-language embedding interactor can obtain richer semantic support.
Loading