Abstract: Visual-language models based on CLIP have shown remarkable abilities in general few-shot image classification. However, their performance drops in specialized fields such as healthcare or agriculture, because CLIP's pre-training does not cover all category data. Existing methods excessively depend on the multi-modal information representation and alignment capabilities acquired from CLIP pre training, which hinders accurate generalization to unfamiliar domains. To address this issue, this paper introduces a novel visual-language collaborative representation network (MCRNet), aiming at acquiring a generalized capability for collaborative fusion and representation of multi-modal information. Specifically, MCRNet learns to generate relational matrices from an information fusion perspective to acquire aligned multi-modal features. This relationship generation strategy is category-agnostic, so it can be generalized to new domains. A class adaptive fine-tuning inference technique is also introduced to help MCRNet efficiently learn alignment knowledge for new categories using limited data. Additionally, the paper establishes a new broad-domain few-shot image classification benchmark containing seven evaluation datasets from five domains. Comparative experiments demonstrate that MCRNet outperforms current state-of-the-art models, achieving an average improvement of 13.06% and 13.73% in the 1-shot and 5-shot settings, highlighting the superior performance and applicability of MCRNet across various domains.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language, [Content] Media Interpretation
Relevance To Conference: To address the issue of existing visual-language models based on CLIP performing poorly in specific domains such as healthcare or industry, this paper proposes a new method that synergizes the alignment and representation processes of visual and text information, named Visual-Language Collaborative Representation Network. We believe this collaborative method of aligning and interacting multimodal information can be extended to the fields of multimodal fusion and understanding. Additionally, this paper constructs a new image classification benchmark covering five domains. This dataset provides a comprehensive evaluation platform for the application of multimodal methods in domain-specific visual tasks. Therefore, we believe this paper aligns with the objectives and scope of the conference.
Supplementary Material: zip
Submission Number: 1441
Loading