Abstract: In this paper, we study the effect of a novel regularization scheme on contrastive language-image pre-trained (CLIP) models. Our approach is based on the observation that, in many domains, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme shrinks most of the pairwise text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context, chest x-rays, where this underlying sparsity hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) average zero-shot performance on the CheXpert and Padchest chest x-ray datasets, outperforming an unregularized version of the model and several recently published self-supervised models.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We added an additional section "Related Works" describing related papers that we aim to differentiate our work from. We added many additional tables and figures to the appendix: head-to-head comparisons between the regularized, unregularized, and cheXzero models for zero-shot classification on cheXpert and Padchest evaluations, additional heatmap examples, additional baselines, hyper parameter sweeps, and moving some figures from the main body to the appendix.
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 687
Loading