Hierarchical Long-tailed Classification with Visual Language Models

18 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Long-Tailed Recognition, Visual Language Models, Large Language Models
Abstract: Vision Language Models (VLMs) have shown promising capabilities in handling open vocabulary tasks but struggle with imbalanced data tuning, particularly when dealing with highly skewed label distributions. To address the challenges, we propose a hierarchical long-tailed classification framework, named HLC, which prioritizes candidate categories before conducting fine-grained classification using detailed textual descriptions. Specifically, we fine-tune a linear classifier based on the CLIP encoder, incorporating visual prompt tokens and leveraging shared feature space mixup for multimodal feature interactions. Based on candidates given by the coarse classifier, we query large language models to generate corresponding fine-grained descriptions to refine the final predictions. Importantly, we introduce a reweighting mechanism to filter out invalid descriptions generated by language models. Extensive evaluations demonstrate that our approach achieves state-of-the-art performance by fine-tuning only a few parameters on the PlacesLT, ImageNet-LT, and iNaturalist 2018 datasets.
Supplementary Material: zip
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1337
Loading