Keywords: zero-shot, CLIP, classification, waffleclip, chils, cupl, scale, accessible, low compute, training-free, automated, semantics
TL;DR: We introduce DefNTaxS, scalably leveraging LLM-generated class taxonomies to augment CLIP text inputs, yielding up to 12.9% (5.5 % avg) accuracy gains across seven benchmarks. DefNTaxS is entirely zero-shot and requires no direct intervention.
Abstract: To successfully use generalized vision-language models (VLMs) like CLIP for zero-shot image classification, the semantics of the target classes must be well defined and easily differentiated. However, test datasets rarely meet either criterion, implicitly encoding ambiguity in class labels, even when adding individual descriptors. Existing literature focuses on improving text inputs by using class-specific descriptors to further refine taxonomic granularity, but largely fails to leverage higher-order semantic relationships among classes. We introduce Defined Taxonomic Stratification (DefNTaxS): a fully automated, procedural, training-free framework that leverages large language models (LLMs) to cluster related classes into hierarchical subcategories and augment CLIP prompts with this taxonomic context. By sculpting text prompts to boost both semantic content and inter-class differentiability, DefNTaxS disambiguates semantically similar classes and improves classification accuracy. Across seven standard benchmarks, including ImageNet, CUB, and Food101, DefNTaxS achieves up to +12.9\% absolute accuracy gain (average +5.5\%) over vanilla ViT-B/32 CLIP and consistent improvement over other recent SOTA, all while enhancing semantic interpretability without any model retraining/modification, manual prompt alteration, or additional optimization data.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24481
Loading