Syntax-Preserving Hyperbolic Visual-Semantic Embeddings

Genji Ohara; Daiki Yoshikawa; Takashi Matsubara

Syntax-Preserving Hyperbolic Visual-Semantic Embeddings

Genji Ohara, Daiki Yoshikawa, Takashi Matsubara

Published: 01 Mar 2026, Last Modified: 28 Mar 2026UCRL@ICLR2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual-Semantic Embeddings, Hyperbolic Embeddings

Abstract: Visual Semantic Embeddings (VSE) map images and text into a shared latent space, serving as a core technique for multi-modal applications. While hyperbolic VSE models effectively capture hierarchical data, they often process text as a simple Bag-of-Words, leading to a lack of compositional understanding. Building on recent findings that large language models implicitly acquire syntactic structure, we propose a method to enhance VSE by explicitly learning the syntactic structure of text. We introduce a novel regularization term that preserves parent-child relations from dependency syntax trees as entailment relations within hyperbolic space. Experiments demonstrate that our method outperforms baselines not only on the VL-CheckList benchmark for compositional understanding but also on standard zero-shot tasks. These results confirm that explicitly incorporating syntactic information improves the compositional capabilities of VSE models.

Submission Number: 27

Loading