Hyperbolic Image-Text Representations

Karan Desai; Maximilian Nickel; Tanmay Rajpurohit; Justin Johnson; Shanmukha Ramakrishna Vedantam

Hyperbolic Image-Text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Shanmukha Ramakrishna Vedantam

Published: 06 Mar 2023, Last Modified: 18 May 2025MRL 2023 SpotlightReaders: Everyone

Keywords: vision and language, representation learning, riemannian geometry, transformers

TL;DR: We propose MERU, a contrastive model that yields hyperbolic representations of images and text.

Abstract: Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept ``dog'' entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text data. Our results show that MERU learns a highly interpretable representation space while being competitive with CLIP's performance on multi-modal tasks like image classification and image-text retrieval.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/hyperbolic-image-text-representations/code)

0 Replies

Loading