Hyperbolic Visual-Semantic Alignment for Structural Visual Recognition

18 Sept 2023 (modified: 22 Feb 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: visual recognition, semantic segmentation, semantic hierarchy
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Our main idea is to learn shared representations of images and semantic concepts in the hyperbolic space.
Abstract: Visual and semantic concepts inherently organize themselves in a hierarchy, where a higher-level textual concept, e.g., Animal, entails all images containing, e.g., Cat. Despite being intuitive, conventional visual recognition systems strive to establish single-level correspondence between images and semantic concepts, and do not explicitly capture the hierarchical relationships that exist. We present HVSA to probe multi-level semantic information, from fine-grained to fully abstracted, within the tree-shaped hierarchy to realize structural visual recognition. Our main idea is to learn shared representations of images and semantic concepts in the hyperbolic space. Hyperbolic spaces possess suitable geometric properties to embed tree-like data structures, thus will help capture the underlying hierarchy. While it is challenging to acquire structure alignment of the two modalities, we achieve the goal through a joint optimization process guided by two primary objectives. First, we propose hierarchy-agnostic visual-semantic alignment, which leverages a Gaussian mixture VAE to establish a “flat” representation space shared by both modalities. Second, we introduce hierarchy-aware semantic learning to cultivate a “hierarchical” feature space for semantic concepts solely through hyperbolic metric learning. These two distinct objectives operate on different granularity and synergistically contribute to hierarchical alignment of visual-semantic features, ultimately enhancing structural image understanding. HVSA shows high efficacy and generality, as evidenced by its notable performance improvements across six datasets, for both image-level (i.e., ImCLEF07A, ImCLEF07D and tieredImageNet-H) and pixel-level (i.e., Cityscapes, LIP, and PASCAL-Person- Part) visual recognition. Our code shall be released.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1310
Loading