An Effective, Efficient, and Scalable Confidence-Based Instance Selection Framework for Transformer-Based Text Classification
Abstract: Transformer-based deep learning is currently the state-of-the-art in many NLP and IR tasks. However, fine-tuning such Transformers for specific tasks, especially in scenarios of ever-expanding volumes of data with constant re-training requirements and budget constraints, is costly (computationally and financially) and energy-consuming. In this paper, we focus on Instance Selection (IS) – a set of methods focused on selecting the most representative documents for training, aimed at maintaining (or improving) classification effectiveness while reducing total time for training (or fine-tuning). We propose E2SC-IS -- Effective, Efficient, and Scalable Confidence-Based IS -- a two-step framework with a particular focus on Transformers and large datasets. E2SC-IS estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. E2SC-IS also exploits iterative heuristics to estimate a near-optimal reduction rate. Our solution can reduce the training sets by 29% on average while maintaining the effectiveness in all datasets, with speedup gains up to 70%, scaling for very large datasets (something that the baselines cannot do).
Artifact Type Made Available By Authors: Code, Dataset
Requested Badges: Artifacts Evaluated – Functional, Artifacts Evaluated – Reusable and Available
Venue Accepted: ACM SIGIR
0 Replies
Loading