Uncertainty-Diversity Ranking Coreset Selection for Efficient Spam Detection

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Coreset Selection, Uncertainty-Aware Decision-Making, Data Efficiency, Transformer Models, Optimization, Spam Detection
Abstract: Efficient spam detection in resource-constrained environments remains challenging due to class imbalance, noisy text, and the computational demands of large Transformer models. We introduce a novel coreset selection framework based on a unified Uncertainty-Diversity Ranking (UDR), which explicitly combines predictive uncertainty with representativeness to prioritize highly informative samples while ensuring diversity and class balance. Our method supports multiple coreset strategies, including Top-K, Bottom-K, and adaptive class-wise selection, enabling robust performance even with a fraction of the training data. Extensive experiments on benchmark datasets, including UCI SMS, UTKML Twitter, and Ling-Spam, show that UDR maintains or improves accuracy, precision, and recall while reducing training data by up to 95\%, significantly lowering computational cost. These results demonstrate the potential of UDR in resource-limited settings.
Track: Track 2: ML by Muslim Authors
Submission Number: 35
Loading