DVT-LLaVA: Vision-Language Model Personalization with Disentangled Visual Tuning

Chenyang Zhu; Kai Li; Yukang Gan; Longxiang Tang; Qing Zong; Chengyu Fang; Xinhao Zhong; Jiangshan Wang; Xiu Li; Ying Shan

DVT-LLaVA: Vision-Language Model Personalization with Disentangled Visual Tuning

Chenyang Zhu, Kai Li, Yukang Gan, Longxiang Tang, Qing Zong, Chengyu Fang, Xinhao Zhong, Jiangshan Wang, Xiu Li, Ying Shan

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-language model, Personalization

TL;DR: A novel framework of VLM personalization that enhances the visual learning capability of the user's concept with disentangled visual tuning

Abstract: Personalizing foundational vision-language models (VLMs) specifically for individual users could enhance user experience when interacting with VLMs. Most existing methods rely on introducing additional trainable tokens and finetuning VLMs to fit the data of users. Despite the demonstrated improvements on some Visual Question Answering (VQA) benchmarks, we reveal that the improvements come mostly from the shortcut approach to memorizing the information from the introduced textual training dataset. The capability of visually understanding the user's target concepts -- key to the VQA tasks -- however remains mostly not improved after finetuning. This is especially true for visual concepts residing in complex backgrounds, as these methods often learn representations with concept-relevant and concept-irrelevant information intertwined. To tackle these issues, we introduce DVT-LLaVA, which learns disentangled visual representations for target concepts by jointly learning the concept-relevant tokens and concept-irrelevant tokens via a crafted vision-text dataset derived from image captions. We further propose to tune the LayerNorm layers to enhance the learning capacity and adopt a text embedding augmentation strategy to mitigate overfitting on the training text-image pairs. In addition, we reveal that the existing evaluation benchmarks in this field are mainly based on multiple-choice questions, which fail to accurately assess model performance in the open-set setting. To remedy this, we establish a new benchmark to evaluate performance on this aspect. Extensive evaluations demonstrate the superiority and versatility of DVT-LLaVA.

Primary Area: generative models

Submission Number: 12210

Loading