MALIP: Improving Few-Shot Image Classification with Multimodal Fusion Enhancement

Kaifen Cai, Kaiyu Song, Yan Pan, Hanjiang Lai

Published: 01 Jan 2024, Last Modified: 05 Nov 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the significant progress in pre-trained visionlanguage models like CLIP, recent CLIP-based methods have shown impressive performance in few-shot tasks. However, CLIPbased representations have a natural gap in downstream few-shot tasks due to the label-related multimodal information scarcity caused by limited data. We then question, whether the generative model trained in downstream tasks could be used to enhance label-related multimodal fusion. In this paper, we propose a generative model-based multimodal fusion enhancement method, MALIP, to improve the few-shot performance of CLIP via a Multimodal Adapter module. Specifically, we first leverage a variational autoencoder (VAE) that could be trained in the few-shot scenario to extend the data. Then we create adapter weights by a key-value cache model constructed from the image and text information based on the expanded data. In the end, through extensive experiments on 11 datasets, we demonstrate the effectiveness of MALIP to perform state-of-the-art few-shot image classification.