Multi-modal Adapter for Medical Vision-and-Language Learning

Zheng Yu, Yanyuan Qiao, Yutong Xie, Qi Wu

Published: 01 Jan 2023, Last Modified: 29 Oct 2023MLMI@MICCAI (1) 2023Readers: Everyone

Abstract: Recently, medical vision-and-language learning has attracted great attention from biomedical communities. Thanks to the development of large pre-trained models, the performances on these medical multi-modal learning benchmarks have been greatly improved. However, due to the rapid growth of the model size, full fine-tuning these large pre-trained models has become costly in training and storing such huge parameters for each downstream task. Thus, we propose a parameter-efficient transfer learning method named Medical Multi-Modal Adapter (M $$^3$$ AD) to mediate this problem. We select the state-of-the-art M $$^3$$ AE model as our baseline, which is pre-trained on 30k medical image-text pairs with multiple proxy tasks and has about 340M parameters. To be specific, we first insert general adapters after multi-head attention layers and feed-forward layers in all transformer blocks of M $$^3$$ AE. Then, we specifically design a modality-fusion adapter that adopts multi-head attention mechanisms and we insert them in the cross-modal encoder to enhance the multi-modal interactions. Compared to full fine-tuning, we freeze most parameters in M $$^3$$ AE and only train these inserted adapters with much smaller sizes. Extensive experimental results on three medical visual question answering datasets and one medical multi-modal classification dataset demonstrate the effectiveness of our proposed method, where $$\mathrm M^{3}AD$$ achieves competitive performances compared to full fine-tuning with much fewer training parameters and memory consumption.

0 Replies