# Captured by Captions: On Memorization and its Mitigation in CLIP Models

*Keywords:* memorization, multi-modal, clip, vision language models

*TL;DR:* We identify where the memorization happens in self-supervised learning encoders.

*Abstract:* Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP’s memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility—something that had not  been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.

## Description of the code

The code mainly contains files below

1. clipmem.py
This file will calculate the CLIPMem for the specific encoder [f and g] pairs mentioned in our paper. Please first install the requirements.txt and then download COCO dataset. Then modify the datapath, savingpath and other parameters according to your device and experiment needs. The output is the .mat file that stores all CLIPMem results for Canary samples.

2. clip_train.py
This file will train the clip model with COCO Dataset. Please first install the requirements.txt and then download COCO dataset. Then modify the datapath, savingpath and other parameters according to your device and experiment needs. The output is the trained model without checkpoint (add checkpoint saving if you needed!)

3. clip_trainniose.py
This file will reproduce the CLIP training with regrouping experiment in our paper. Please first install the requirements.txt and then download COCO dataset. Then modify the datapath, savingpath and other parameters according to your device and experiment needs. The output is the trained model without checkpoint (add checkpoint saving if you needed!)

4. clip_train_regrouping.py
This file will reproduce the CLIP training with noised caption representation experiment in our paper. Please first install the requirements.txt and then download COCO dataset. Then modify the datapath, savingpath and other parameters according to your device and experiment needs. The output is the trained model without checkpoint (add checkpoint saving if you needed!)

5. DINO_train.py (with file folder 'DINO')
This file will train the DINO model based on ViT-Base with COCO Dataset. Please first install the requirements_Dino.txt and then download COCO dataset. Then make sure the DINO_train.py is in same folder with the model files in DINO folder. Finally, modify the datapath, savingpath and other parameters according to your device and experiment needs. The output is the trained model without checkpoint (add checkpoint saving if you needed!)

6. VIT_sl.py
This file will train the ViT-Base model with a supervised multi-label classifier (one full connection layer) on COCO Dataset. Please first install the requirements_Dino.txt and then download Coco Dataset for Multi-label Image Classification (you can download it here: https://www.kaggle.com/datasets/shubham2703/coco-dataset-for-multi-label-image-classification/data) . Then modify the datapath, savingpath and other parameters according to your device and experiment needs. The output is the trained model without checkpoint (add checkpoint saving if you needed!)

7. vittiny_poison.py
This file will train the ViT-tiny model with supervised classifier on CIFAR10 dataset. Please first install the requirements.txt and then download CIFAR10 dataset . Then modify the datapath, savingpath and other parameters according to your device and experiment needs. This code will poison first 200 samples of CIFAR10 training set (you can change the number as you want). The output is the trained model without checkpoint (add checkpoint saving if you needed!)

8. UnitMem.py
This is code from work: Wang, Wenhao, et al. "Localizing Memorization in SSL Vision Encoders." Accepted as conference paper at NeurIPS 2024.
Obtain from https://github.com/sprintml/SSLLocalizeMemorization
This code is modified to work with CLIP encoders. Make sure to make this file under the same folder with the model_XXX.py as well as the model.pt. Also modify the datapath, savingpath, model name, and other parameters before using. The default setting for UnitMem now is based on the augmentation set used during CLIP training process (i.e. RandomResizedCrop(224)). If other augmentation sets are used during training, please also add them to augmentation set.

9. image.py
This file will use stable-diffusion-v1.5 to generate images based on image captions of COCO Dataset. Please first install the requirements.txt and then download coco dataset. Then modify the datapath, savingpath and other parameters according to your device and experiment needs. 
If windows devices are used, be sure to set up symlinks. Here is guide https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development

10. caption.py
This file will use GPT3.5-turbo to generate captions based on a sample caption and image embeddings  (generated by our CLIP model) of COCO Dataset. Please first install the requirements.txt and then download coco dataset. Then modify the datapath, savingpath and other parameters according to your device and experiment needs. 




