Abstract: We introduce Patch Aligned Contrastive Learning
(PACL), a modified compatibility function for CLIP’s contrastive loss, intending to train an alignment between the
patch tokens of the vision encoder and the CLS token of the
text encoder. With such an alignment, a model can identify
regions of an image corresponding to a given text input, and
therefore transfer seamlessly to the task of open vocabulary
semantic segmentation without requiring any segmentation
annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art
on the task of open vocabulary zero-shot segmentation on
4 different segmentation benchmarks: Pascal VOC, Pascal
Context, COCO Stuff and ADE20K. Furthermore, we show
that PACL is also applicable to image-level predictions and
when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to
CLIP, across a suite of 12 image classification datasets.
0 Replies
Loading