Abstract: Large vision-language models (VLMs) are shown to learn rich joint image-text represen- tations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of ‘Teaching CLIP to Count to Ten’ (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We contribute to the existing methods by improving the model’s performance on a smaller sub- set of their training data with lower computational resources. We verify these claims by reproducing their study with our own open-source code. The implementation can be found at https://anonymous.4open.science/r/CountCLIP-FA07.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=rXUN8SKaAv
Changes Since Last Submission: We have anonymized and removed all public instances of our repository and have made revisions to the paper clarifying the intent of our work.
Assigned Action Editor: ~Kamalika_Chaudhuri1
Submission Number: 2582
Loading