Hierarchy Matters: Learning Vision–Language Representations in Hyperbolic Space
Keywords: Vision–Language Models, Hyperbolic Representation Learning, Geometry-Aware Embeddings, Hierarchical Representation Learning, Multimodal Learning, Compositional Generalization, Structure-Aware Modeling, Poincaré Geometry, Cross-Modal Retrieval
Abstract: Vision–language models (VLMs) have achieved remarkable performance by aligning images and text in a shared Euclidean space. However, Euclidean embeddings struggle to capture hierarchical structures inherent in multimodal data, such as conceptual taxonomies or fine-grained categories. We propose HVL (Hyperbolic Vision–Language), a geometry-grounded framework that maps images and text into a Poincaré manifold to induce hierarchy-aware representations. HVL leverages the exponential capacity of hyperbolic space to preserve semantic distances across multiple levels of abstraction, while an adaptive, entropy-driven entailment loss enforces hierarchical ordering between modalities. Extensive experiments on zero-shot classification and image–text retrieval benchmarks demonstrate that HVL consistently outperforms Euclidean baselines (e.g., CLIP) and Lorentz-based hyperbolic models (e.g., MERU), particularly in scenarios requiring fine-grained hierarchical understanding. These results highlight the importance of respecting geometric structure and establish hyperbolic embeddings as a principled foundation for hierarchical multimodal representation learning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 24
Loading