Memory-Enhanced MLLM for Long-Context Understanding: Addressing Non-Semantic Retrieval

Chuanyu Zhang; Bo Yang; Chenghua Li

Memory-Enhanced MLLM for Long-Context Understanding: Addressing Non-Semantic Retrieval

Chuanyu Zhang, Bo Yang, Chenghua Li

26 Sept 2024 (modified: 10 Oct 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-modal MLLM long-context LLM memory

Abstract: Recent studies have expanded the training data for multi-modal large models, enabling them to handle tasks involving multiple images and videos. However, these models still exhibit suboptimal performance in tasks requiring longer contextual understanding due to limitations in model architecture and training data. Furthermore, differences in model structures often result in incompatible approaches to handling long-context multi-modal tasks. In this paper, we propose a lightweight multi-modal memory component to enhance the long-context processing capabilities of existing multi-modal large models. Our memory component is model-agnostic, allowing it to be applied across different multi-modal architectures. Specifically, we adopt a memory construction approach similar to RAG, where multi-modal input is divided into segments, each encoded with distinct features. Relevant memories are retrieved based on the current input to guide the final generation. During our research, we observed that purely semantic retrieval is often insufficient to provide all the necessary information for multi-modal generation tasks. To address this, we introduce the concept of non-semantic retrieval, which encompasses retrieval tasks that cannot rely solely on semantic information in long-context multi-modal inputs. We compile a variety of common non-semantic retrieval scenarios and establish a corresponding dataset. Based on this, we design and train a model capable of performing both semantic and non-semantic retrieval. Our model leverages attention mechanisms to capture non-semantic information and employs gating mechanisms to balance semantic and non-semantic retrieval results, generating fused feature vectors. This allows our retrieval model to remain compatible with FAISS for high-speed retrieval. We evaluate our approach on three benchmarks focused on long-context multi-modal tasks, demonstrating the effectiveness of the memory module and non-semantic retrieval in enhancing the performance of multi-modal large models, particularly in the challenging tasks of cross-modal image-text interaction and long video understanding. To the best of our knowledge, this is the first work to explore multi-modal memory in large models. We hope our contributions will inspire further research on multi-modal large language models and multi-modal retrieval.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6992

Loading