Keywords: Hierarchical Image Tagging, Hyperbolic Space, Vision-Language Modeling
Abstract: Image tagging, also known as multi-label image recognition, aims to assign multiple semantic labels to a given image. However, few benchmarks have been tailored for *hierarchical image tagging* to measure the hierarchical classification accuracy, where a concept ‘Shiba Inu’ should be recognized as both ‘dog’ and ‘animal’. To explicitly capture such hierarchy, we introduce a hierarchical image tagging benchmark, termed HiTag, to evaluate the multi-visual context from a hierarchical perspective. Specifically, we first construct a tree-like hierarchical structure for the tags based on lexical semantic databases, *i.e.*, WordNet and YAGO, including *10 levels* and *3,334 labels*. The hierarchy is consistent with visual perception through optimization by a large model and can be dynamically updated for unexplored tags by locating their positions in WordNet and assessing their validity using a large model. With the designed hierarchical structure, we utilized large language models to annotate 2,872,012 images from CC3M as training data and manually tagged 57,223 images from OpenImage as test data, to advance the exploration of the hierarchical image tagging task. Meanwhile, we develop a pipeline to assess the hierarchical classification capacity of models on multiple levels, including tree edit distance, Jaccard similarity, hierarchical precision, hierarchical recall metrics, *etc.* Furthermore, we embed hierarchical tags, images, and captions into hyperbolic space for modeling, leveraging its inherent suitability for representing tree-structured data. Experimental results on the HiTag confirm that our method not only demonstrates superior performance of zero-shot image tagging, but also achieves state-of-the-art results on hierarchical image tagging modeling. We will release the code and the dataset to support the community.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13346
Loading