Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

Zhen Zeng, Leijiang Gu, Xun Yang, Zhangling Duan, Zenglin Shi, Meng Wang

Published: 2025, Last Modified: 30 May 2026ICCV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing knowledge editing works for MultiModal Large Language Models primarily focus on text-oriented, coarsegrained scenarios, where modifying textual content alone is sufficient. As a result, they fail to capture the unique challenges of multi-modal editing, particularly when visual information is central to knowledge representation. In this paper, we introduce a visual-oriented, fine-grained multi-modal knowledge editing task that targets precise modifications in images containing multiple interacting entities. To support this, we propose the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark, designed to evaluate the accuracy and effectiveness of multi-modal editing at a granular level. To address this challenge, we present the Multimodal Scope Classifier-based Knowledge Editor (MSCKE), a new framework that leverages a multi-modal scope classifier to integrate both textual and visual information. By accurately identifying and updating knowledge localized within images, MSCKE ensures precise editing while preserving unrelated content. Extensive experiments on the FGVEdit benchmark highlight the complexity of this new task and demonstrate that existing methods struggle with fine-grained multi-modal editing. Our results highlight MSCKE as a scalable and promising framework for advancing multi-modal knowledge editing. Code is available at https://github.com/zeng-zhen/FGVEdit.
Loading