Abstract: In this paper, we propose a hypothesis which posits that CLIP disentangles compositional visual attributes into orthogonal, independent subspaces which CLIP uses to build compositional representations of images. Our hypothesis suggests that CLIP learns compositional techniques that are similar to humans'. We find five core compositional attributes predicted by the hypothesis: color, size, counting, camera view, and pattern. We empirically test their properties and find that they code for their respective compositional attribute type and are essentially orthogonal to one another, as well as the subject of the image.
0 Replies
Loading