Keywords: Large vision-language model, image tagging
Abstract: Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs, especially for large-scale label spaces. While Large Vision-Language Models (LVLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored.
This paper systematically investigates LVLMs' annotation quality and the performance of models trained on LVLM-generated labels. Our analysis reveals that LVLMs achieve competitive performance on common categories but lag behind humans on uncommon or ambiguous categories. Surprisingly, models trained on LVLM-generated labels outperform those trained on human-annotated labels in certain categories, suggesting imperfections in human annotations.
Motivated by these findings, we propose \textsc{LVLMAnt}, a novel framework for image tagging, which aims to achieve human-level annotation ability. \textsc{LVLMAnt} comprises two components: Prompts-to-Candidates (P2C), which employs group-wise prompting and annotation ensembling to efficiently produce a candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and Concept-Aligned Disambiguation (CAD), which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels.
Extensive experiments on benchmark datasets demonstrate \textsc{LVLMAnt}’s effectiveness in balancing annotation quality and automation, significantly reducing reliance on manual effort while achieving performance comparable to human annotations.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16538
Loading