Keywords: Coreset Selection, Uncertainty-Aware Decision-Making, Data Efficiency, Transformer Models, Optimization, Spam Detection
Abstract: \textbf{Motivation.} Spam detection in online communication remains challenging due to severe class imbalance, rapidly evolving spam patterns, and the high computational demands of Transformer-based models. Supervised learning approaches often underperform when spam messages are underrepresented, while training on full datasets incurs significant cost. Existing coreset selection methods typically focus on either uncertainty-based sampling (e.g., entropy, margin, confidence) or diversity-based selection (e.g., clustering, k-center, representativeness). However, these strategies often overlook class imbalance and fail to jointly capture both informativeness and representativeness. This highlights the need for efficient data reduction techniques that maintain predictive accuracy while reducing annotation and training overhead.
\textbf{Method.} We propose a novel coreset selection framework, \emph{Class-Balanced Uncertainty-Density Ranking (CBUDR)}, which simultaneously captures predictive uncertainty, representativeness, and class balance. Each sample is assigned a class-normalized uncertainty score, $U_c(x_i) = \frac{U(x_i)}{\max_{x \in C_c} U(x)}$, mitigating over-prioritization of minority or noisy samples. To ensure geometric coverage, a density score $D(x_i) = 1 - \frac{1}{|N_i|} \sum_{x_j \in N_i} \text{sim}(e_i, e_j)$ is computed, where $N_i$ denotes the $k$-nearest neighbors in embedding space and $\text{sim}$ is cosine similarity; higher scores highlight sparsely populated regions. These components are combined via a convex score, $\text{CBUDR}(x_i) = \alpha \cdot U_c(x_i) + \beta \cdot D(x_i)$ with $\alpha + \beta = 1$, providing a controlled trade-off between exploration (uncertainty) and coverage (representativeness). Samples are then ranked by this score, and the highest-ranked subset is selected as the coreset for training.
\textbf{Results.} We evaluate our approach, CBUDR, against random sampling and conventional uncertainty/diversity strategies across three benchmark datasets for SMS, email, and Twitter spam detection (Table~\ref{tab:combined}). The results consistently show that class-wise Bottom-K with CBUDR achieves near-perfect accuracy and F1-scores ($\geq 99\%$) using only 5\% of the training data, outperforming both random selection and Top-K uncertainty methods. On UtkMl Twitter and LingSpam, CBUDR not only surpasses the full-data baseline but also demonstrates that ``easy yet representative'' samples selected via Bottom-K ranking yield stronger generalization than traditional high-uncertainty examples, highlighting a previously underexplored regime of coreset design. These findings demonstrate that principled data selection can simultaneously improve efficiency and generalization.
\textbf{Impact.} CBUDR enables lightweight spam filtering systems that are computationally efficient and suitable for real-time or resource-constrained environments. By explicitly incorporating class balance, the method promotes equitable treatment of minority spam messages, which are often overlooked in standard selection criteria. Beyond spam detection, CBUDR provides a general framework for uncertainty-aware data reduction, with potential applications in fraud detection, misinformation filtering, and other domains requiring robust learning under imbalance.
Submission Number: 191
Loading