Abstract: With the advent of the era of foundation models, pre-training and fine-tuning have become common paradigms. Recently, parameter-efficient fine-tuning (PEFT) has garnered widespread attention due to its better balance between the number of learnable parameters and performance. However, existing PEFT methods primarily focus on single-modal modeling, which limits their ability to leverage cross-modal complementary information, resulting in weaker generalization capabilities. Additionally, these methods often fail to effectively utilize structural knowledge in downstream tasks, a limitation that becomes particularly pronounced in scenarios requiring cross-modal interaction and hierarchical relationship modeling. To address this issue, this paper proposes a multi-modal parameter-efficient fine-tuning method based on graph networks, GA-Net. Each image is fed into a multi-modal large language model (MLLM) to generate a text description. The image and its corresponding text description are then encoded separately using frozen image and text encoders to extract visual and textual features. These features are fused through a cross-attention mechanism to form multi-modal features. A graph is constructed based on the cosine similarity of multi-modal feature nodes, and knowledge and associations between features are mined from each node of the graph. Furthermore, we innovatively introduce a multi-modal Elastic Weight Consolidation (EWC) regularization term into the loss function to mitigate catastrophic forgetting during task learning. Experimental results demonstrate that GA-Net achieves significant improvements over state-of-the-art (SOTA) methods, with test accuracy improvements of 5.32%, 3.05%, and 1.09% on the OxfordPets, Flowers102, and Food101 datasets, respectively.
External IDs:dblp:journals/apin/ChengL25
Loading