PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

ICLR 2026 Conference Submission4666 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-language representation learning, compositionality, Boolean algebra, hyperbolic embedding
TL;DR: For multi-modal representation learning, we propose to use an l1-product metric space of hyperbolic spaces, which simultaneously captures hierarchy and compositionality.
Abstract: Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., *dog* $\preceq$ *mammal* $\preceq$ *animal*) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ *dog*, *car*). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose *PHyCLIP*, which employs an $\ell_1$-*P*roduct metric on a Cartesian product of *Hy*perbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 4666
Loading