everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Vision Language Models (VLMs) have shown promising capabilities in handling open vocabulary tasks but struggle with imbalanced data tuning, particularly when dealing with highly skewed label distributions. To address the challenges, we propose a hierarchical long-tailed classification framework, named HLC, which prioritizes candidate categories before conducting fine-grained classification using detailed textual descriptions. Specifically, we fine-tune a linear classifier based on the CLIP encoder, incorporating visual prompt tokens and leveraging shared feature space mixup for multimodal feature interactions. Based on candidates given by the coarse classifier, we query large language models to generate corresponding fine-grained descriptions to refine the final predictions. Importantly, we introduce a reweighting mechanism to filter out invalid descriptions generated by language models. Extensive evaluations demonstrate that our approach achieves state-of-the-art performance by fine-tuning only a few parameters on the PlacesLT, ImageNet-LT, and iNaturalist 2018 datasets.