Measuring Human-CLIP Alignment at Different Abstraction Levels

Published: 02 Mar 2024, Last Modified: 05 May 2024ICLR 2024 Workshop Re-Align PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: short paper (up to 5 pages)
Keywords: CLIP, human-alignment, abstraction levels, different complexities
TL;DR: Using behavioural data of different abstraction levels, we measure depth-wise human-CLIP alignment depending on different architecture designs and training procedures.
Abstract: Measuring the human alignment of trained models is gaining traction because it is not clear to which extent artificial image representations are proper models of the visual brain. Employing the CLIP model and some of its variants as a case study, we showcase the importance of using different abstraction levels in the experiments, because when measuring image distances, the differences between them can have lower or higher abstraction. This allows us to extract richer conclusions about the models while showing some interesting phenomena arising when analyzing the models in a depth-wise fashion. The conclusions extracted from our analysis identify the size of the patches in which the image is divided as the most important factor in achieving a high human alignment for all the abstraction levels. We found that the method used to compute the model distances is crucial to avoid alignment drops. Moreover, replacing the usual softmax activation with a sigmoid also increases the human alignment at all the abstractions especially in the last model layers. Surprisingly, training the model with Chinese captions or medical data gives more human-aligned models but only on low abstraction levels.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 36
Loading