Abstract: Active learning (AL) aims to select highly informative data points from an unlabeled dataset for annotation, mitigating the need for extensive human labeling effort. However, classical AL methods heavily rely on human expertise to design the sampling strategy, inducing limited scalability and generalizability. Many efforts have sought to address this limitation by directly connecting sample selection with model performance improvement, typically through influence function. Nevertheless, these approaches often ignore the dynamic nature of model behavior during training optimization, despite empirical evidence highlights the importance of dynamic influence to track the sample contribution. This oversight can lead to suboptimal selection, hindering the generalizability of model. In this study, we explore the dynamic influence based data selection strategy by tracing the impact of unlabeled instances on model performance throughout the training process. Our theoretical analyses suggest that selecting samples with higher projected gradients along the accumulated optimization direction at each checkpoint leads to improved performance. Furthermore, to capture a wider range of training dynamics without incurring excessive computational or memory costs, we introduce an additional dynamic loss term designed to encapsulate more generalized training progress information. These insights are integrated into a universal and task-agnostic AL framework termed Dynamic Influence Scoring for Active Learning (DISAL). Comprehensive experiments across various tasks have demonstrated that DISAL significantly surpasses existing state-of-the-art AL methods, demonstrating its ability to facilitate more efficient and effective learning in different domains.
Loading