SCMF: Lightweight Retrieval-Augmented Generation via Retrieval Vector Compression

Jiaquan Zhang; Chaoning Zhang; Qigan Sun; Yibei Liu; Xudong Wang; Pengcheng Zheng; Chenghao Li; Sihan Cao; Caiyan Qin; Jiwei Wei; Yang Yang

SCMF: Lightweight Retrieval-Augmented Generation via Retrieval Vector Compression

Jiaquan Zhang, Chaoning Zhang, Qigan Sun, Yibei Liu, Xudong Wang, Pengcheng Zheng, Chenghao Li, Sihan Cao, Caiyan Qin, Jiwei Wei, Yang Yang

20 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation, Semantic Memory, Vector Quantization, Residual Vector Quantization, PCA Compression, Efficient Retrieval

TL;DR: We propose SCMF, a semantic compressed memory framework that accelerates retrieval in RAG by PCA+RVQ compression while preserving knowledge traceability.

Abstract: With the widespread adoption of Retrieval-Augmented Generation (RAG) in knowledge-intensive tasks, efficiency bottlenecks become increasingly evident: storing and retrieving large-scale high-dimensional embeddings incur substantial storage and computation costs. To address this challenge, we propose the Semantic Compressed Memory Framework (SCMF), a lightweight and traceable indexing paradigm tailored for large-scale RAG. SCMF first projects document embeddings into a low-dimensional semantic space, and then discretizes them into compact Semantic Memory Units (SMUs) via Residual Vector Quantization (RVQ). Each SMU is explicitly linked to its corresponding Raw Knowledge Unit (RKU) through a semantic inverted index, which enables efficient CRUD operations while preserving the traceability of retrieval results. During retrieval, SCMF performs Approximate Nearest Neighbor (ANN) search in the SMU space, followed by a two-stage re-ranking strategy that combines sparse retrieval (BM25) and dense retrieval, thereby achieving efficient and accurate evidence localization. Experimental results demonstrate that SCMF substantially reduces storage costs and retrieval latency while preserving explicit traceability to the original knowledge units, significantly outperforming mainstream vector indexing methods.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24811

Loading