Optimized Statistical Ranking is All You Need for Robust Coreset Selection in Efficient Transformer-Based Spam Detection
Keywords: Spam detection, coreset selection, Optimization, class imbalance, computational efficiency
Abstract: Spam detection, particularly in resource-constrained environments, remains a challenging task due to issues like class imbalance, noisy text, and large-scale data requirements. Transformer-based models have demonstrated state-of-the-art performance in text classification tasks; however, their reliance on large datasets makes training computationally expensive and often impractical for real-world applications. To address this, we introduce a novel coreset selection strategy for efficient spam detection, leveraging a unified Uncertainty-Diversity Ranking (UDR) framework.
Our method combines uncertainty-based entropy measures with diversity-driven techniques, ensuring that high-uncertainty samples are prioritized for training while promoting diversity within the selected coreset. The proposed approach supports multiple coreset selection strategies, including Top-K, Bottom-K, and adaptive schemes, making it flexible across various use cases. Moreover, our method implicitly addresses class imbalance by balancing uncertain samples across different classes, ensuring that minority classes are adequately represented.
We evaluate the effectiveness of our approach on several benchmark spam detection datasets, including UCI SMS, UTKML Twitter, and LingSpam. Experimental results show that our method achieves competitive performance in terms of accuracy, precision, and recall, while significantly reducing the size of the training data. This results in faster training times and lower computational costs, making our approach particularly suitable for mobile devices, low-power communication systems, and other resource-limited environments.
Submission Number: 128
Loading