GHVL: Geometry-Grounded Hyperbolic Vision-Language Models for Hierarchical Multimodal Representation Learning

Kathy Wu; Sarthak Srivastava

GHVL: Geometry-Grounded Hyperbolic Vision-Language Models for Hierarchical Multimodal Representation Learning

Kathy Wu, Sarthak Srivastava

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: Vision–language models; hyperbolic embeddings; hierarchical representation learning; multimodal contrastive learning; geometric deep learning; Hyperbolic geometry; hierarchy-aware representations; geometric inductive bias; non-Euclidean representation learning

Abstract: Vision-language models (VLMs) have achieved remarkable performance by aligning visual and textual representations in a shared Euclidean space. However, Euclidean representations inherently fail to capture hierarchical semantic structures present in multimodal data, such as fine-grained categories or conceptual hierarchies. We propose GHVL, a geometry-grounded hyperbolic VLM that maps images and text into a Poincaré manifold to induce hierarchy-aware representations. By leveraging the exponential capacity of hyperbolic space, GHVL preserves semantic distances across multiple hierarchy levels, enabling faithful modeling of fine-grained concepts. We introduce an adaptive, entropy-driven entailment loss to enforce hierarchical ordering between modalities and integrate it into contrastive objectives for cross-modal alignment. Evaluation on zero-shot classification and image-text retrieval benchmarks demonstrates consistent improvements over Euclidean baselines such as CLIP and Lorentz-based MERU, particularly in hierarchy-sensitive scenarios. These results highlight the importance of respecting geometric structure and demonstrate that hyperbolic representations provide a principled foundation for hierarchical multimodal understanding.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 43

Loading