Keywords: Visual-Semantic Embeddings, Hyperbolic Embeddings
Abstract: Visual Semantic Embeddings (VSE) map images and text into a shared latent space, serving as a core technique for multi-modal applications. While hyperbolic VSE models effectively capture hierarchical data, they often process text as a sim-
ple Bag-of-Words, leading to a lack of compositional understanding. Building on recent findings that large language models implicitly acquire syntactic structure, we propose a method to enhance VSE by explicitly learning the syntactic structure of text. We introduce a novel regularization term that preserves parent-child relations from dependency syntax trees as entailment relations within hyperbolic space. Experiments demonstrate that our method outperforms baselines not
only on the VL-CheckList benchmark for compositional understanding but also on standard zero-shot tasks. These results confirm that explicitly incorporating syntactic information improves the compositional capabilities of VSE models.
Submission Number: 27
Loading