VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models

Hang Gao; Yongfeng Zhang

VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models

Hang Gao, Yongfeng Zhang

25 Sept 2024 (modified: 11 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Algorithms, Large Language Model, NP-complete, Vector Retrieval

TL;DR: We prove that achieving a fair balance between similarity and diversity in vector retrieval is an NP-complete problem and introduce a heuristic algorithm that efficiently selects high-quality contextual examples for large language models.

Abstract: Vector retrieval algorithms are essential for semantic queries within the rapidly evolving landscape of Large Language Models (LLMs). The ability to retrieve vectors that satisfy both similarity and diversity criteria substantially enhances the performance of LLMs. Although Maximal Marginal Relevance (MMR) is widely employed in retrieval scenarios requiring relevance and diversity, variations in the parameter \( \lambda \) lead to fluctuations that complicate the optimization trajectory in vector spaces. This obscures the direction of improvement and highlights the lack of a robust theoretical analysis regarding similarity and diversity constraints in retrieval processes. To address these challenges, this paper introduces a novel approach that characterizes both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors ensures the similarity constraint, while requiring individual vectors within the sum vector to diverge in their alignment with the query vector satisfies the diversity constraint. We first formulate a new combinatorial optimization problem, selecting \( k \) vectors from a candidate set such that their sum vector maximally aligns with the query vector, and demonstrate that this problem is \textbf{NP-complete}. This result underscores the inherent difficulty of simultaneously achieving similarity and diversity in vector retrieval, thereby providing a theoretical foundation for future research. Subsequently, we present the heuristic algorithm \underline{\textbf{V}}ectors \underline{\textbf{R}}etrieval with \underline{\textbf{S}}imilarity and \underline{\textbf{D}}iversity, \textbf{VRSD}, which features a clear optimization objective and eliminates the need for preset parameters. VRSD also achieves a modest reduction in time complexity compared to MMR. Empirical validation confirms that VRSD significantly outperforms MMR across various datasets, while also demonstrating that the sum vector effectively captures both diversity and similarity simultaneously. The data and code are available at https://anonymous.4open.science/r/VRSD-CF9D.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4967

Loading