Abstract: We focus on addressing the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-training (CLIP) models. Centered on our hypothesis that counting knowledge can be abstracted into linear vectors within the text embedding space, we develop a parameter-efficient fine-tuning method and several zero-shot methods to improve CLIP's counting accuracy. Through comprehensive experiments, we demonstrate that our learning-based method not only outperforms full-model fine-tuning in counting accuracy but also retains the broad capabilities of pre-trained CLIP models. Our zero-shot text embedding editing techniques are also effective in situations where training data is scarce, and can be extended to improve Stable Diffusion's ability to generate images with precise object counts.We also contribute two specialized datasets to train and evaluate CLIP’s counting capabilities.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Addressed comments from 3 reviewers by adding additional experiments, comparisons, and discussions.
Highlighted main changes in the main texts and tables and figures added for new exps.
Assigned Action Editor: ~Massimiliano_Mancini1
Submission Number: 3239
Loading