GHVL: Geometry-Grounded Hyperbolic Vision-Language Models for Hierarchical Multimodal Representation Learning
Track: long paper (up to 8 pages)
Keywords: Vision–language models; hyperbolic embeddings; hierarchical representation learning; multimodal contrastive learning; geometric deep learning; Hyperbolic geometry; hierarchy-aware representations; geometric inductive bias; non-Euclidean representation learning
Abstract: Vision-language models (VLMs) have achieved remarkable performance by aligning visual and textual representations in a shared Euclidean space. However, Euclidean representations inherently fail to capture hierarchical semantic structures present in multimodal data, such as fine-grained categories or conceptual hierarchies. We propose GHVL, a geometry-grounded hyperbolic VLM that maps images and text into a Poincaré manifold to induce hierarchy-aware representations. By leveraging the exponential capacity of hyperbolic space, GHVL preserves semantic distances across multiple hierarchy levels, enabling faithful modeling of fine-grained concepts. We introduce an adaptive, entropy-driven entailment loss to enforce hierarchical ordering between modalities and integrate it into contrastive objectives for cross-modal alignment. Evaluation on zero-shot classification and image-text retrieval benchmarks demonstrates consistent improvements over Euclidean baselines such as CLIP and Lorentz-based MERU, particularly in hierarchy-sensitive scenarios. These results highlight the importance of respecting geometric structure and demonstrate that hyperbolic representations provide a principled foundation for hierarchical multimodal understanding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 43
Loading