Keywords: Multimodal Models, Compositional Image Retrieval, Attention Editing, Web-Scale Retrieval, Hierarchical Masking, Prompt Localization
TL;DR: This paper introduces an attention-editing framework for Compositional Image Retrieval that efficiently integrates web-scale knowledge through structured prompting and dynamic masking.
Abstract: This paper proposes an attention-editing-based network knowledge infusion method aimed at enhancing the comprehension and utilization of complex web-scale knowledge in Compositional Image Retrieval (CIR) models. Addressing the limitations of conventional multimodal models in processing massive web knowledge, this study develops an innovative attention-guided knowledge infusion framework through the construction of a structured knowledge-enhanced dataset. The proposed method achieves progressive transmission of web knowledge from coarse to fine granularity via a carefully designed prompt localization system and a hierarchically controlled masking mechanism. Specifically, structured prompt templates encode web knowledge into learnable semantic units, while dynamic attention editing governs the knowledge injection process, enabling the model to adaptively filter and integrate heterogeneous multi-source web knowledge. Experimental results demonstrate that this approach not only significantly improves the model's efficiency in capturing implicit web knowledge but also effectively mitigates knowledge conflicts and redundancy issues. Our work establishes a new technical paradigm for knowledge distillation and transfer in multimodal retrieval systems.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 12431
Loading