Track: Main Track
Keywords: diversity sampling, principal component analysis, embeddings
TL;DR: We present and compare approaches for picking an ordered list of diverse samples from a data set containing text.
Abstract: The goal of diversity sampling is to select a representative subset of data in a way that maximizes the information contained in the subset while keeping its cardinality small. We introduce the ordered diversity sampling problem and present a novel and simple approach for generating ordered diverse samples for textual data that uses principal components on the embedding vectors. We compare our approach with existing approaches using a new metric that measures diversity in an ordered list of samples. We transform standard text classification benchmarks into benchmarks for ordered diversity sampling and show that prevailing approaches perform $6$\% to $61$\% worse than our method while also being more time inefficient.
Submission Number: 60
Loading