Keywords: few-shot learning, interpretability, large vision-language models, meta-task instruction framework, learnable category embeddings
TL;DR: This paper introduces ALERT, a novel approach to few-shot learning that enhances interpretability and accuracy in large vision-language models by using reinforcement and adaptive learning with embeddings.
Abstract: In this paper, we introduce ALERT, a novel approach to few-shot learning that significantly enhances both the interpretability and accuracy of large vision-language models (LVLMs) in classification tasks with limited data. By utilizing the strengths of LVLMs and integrating a meta-task instruction framework, ALERT effectively transforms the traditional black-box nature of few-shot models into a transparent process. It allows for traceable and understandable reasoning. ALERT employs learnable category embeddings to emphasize unique features of each category, improving classification accuracy, and introduces a contrastive reward function within a Group Relative Policy Optimization (GRPO) training framework to enhance reasoning capabilities and training stability. Our extensive experiments across various datasets demonstrate that ALERT consistently outperforms existing few-shot learning methods, achieving state-of-the-art results. Notably, in the 16-shot setting on ImageNet, ALERT achieved an impressive accuracy of 78.74\%, significantly improving on previous methods.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 8348
Loading