MUSS: Multilevel Subset Selection for Relevance and Diversity

Vu Nguyen; Andrey Kan

MUSS: Multilevel Subset Selection for Relevance and Diversity

Vu Nguyen, Andrey Kan

18 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diversity quality selection, distributed setting

TL;DR: efficient diversity and quality data selection

Abstract: The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (\mmr{}) are based on greedy selection. Many real-world applications involve large data, but the original \mmr{} work did not consider distributed selection. This limitation was later addressed by a method called \dgds{} which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose \modelname{}, an efficient method that uses a multilevel approach to relevant and diverse selection. In a recommender system application, our method can not only improve the performance up to $4$ percent points in precision, but is also $20$ to $80$ times faster. Our method is also capable of outperforming baselines on RAG-based question answering accuracy. We present a novel theoretical approach for analyzing this type of problems, and show that our method achieves a constant factor approximation of the optimal objective. Moreover, our analysis also results in a $\times 2$ tighter bound for \dgds{} compared to previously known bound.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 10560

Loading