Abstract: Despite the significant advances made in large language models (LLMs) and vision-language models (VLMs), research on role-playing (RP) within VLMs remains in its nascent stages, with a conspicuous lack of systematic evaluations of their role-playing capabilities. This study aims to address this gap by exploring how specific prompts related to different roles influence VLM performance in image description tasks. We propose a comprehensive evaluation framework specifically designed to assess the role-playing abilities of VLMs, encompassing classification accuracy, semantic similarity, lexical diversity, and potential hazards of generated content. Our findings indicate that as the age of the roles increases, the performance of VLMs improves significantly; models portraying older roles produce descriptions that are semantically more accurate and contextually richer. Furthermore, the introduction of domain-specific roles markedly enhances model performance, particularly when expert knowledge aligns with task requirements. This study not only underscores the necessity for a systematic assessment of role-playing capabilities in VLMs but also provides valuable insights for the development of multimodal systems that exhibit contextual awareness and moral responsibility across various applications.
External IDs:dblp:conf/ijcnn/NiuCWCLL25
Loading