Stochastic Featurization for Active Learning

Linh Le; Minh-Tien Nguyen; Khai Phan Tran; Genghong Zhao; Xia Zhang; Guido Zuccon; Gianluca Demartini

Stochastic Featurization for Active Learning

Linh Le, Minh-Tien Nguyen, Khai Phan Tran, Genghong Zhao, Xia Zhang, Guido Zuccon, Gianluca Demartini

Published: 01 Jan 2024, Last Modified: 15 May 2025TAI4H 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, the demand for high-quality data has intensified, particularly in the medical field where accurate data annotation is costly and critical. Active Learning (AL) has emerged as a pivotal approach in these scenarios, where selecting high-quality data for training machine learning models is essential. This paper introduces a novel method, “Stochastic Featurization for Active-learning” (SFAL), designed to efficiently identify hard-to-classify unlabeled data within both medical and general datasets. Unlike traditional AL methods that rely on a pre-trained estimator, SFAL extracts novelty features from the latent representations of a target model, thereby circumventing the need for extensive initial training and facilitating the selection of a diverse array of challenging medical data samples. This technique is particularly effective in the context of medical text classification and named entity recognition, areas where precise data interpretation is crucial. Our extensive testing across seven benchmark datasets, including those specific to clinical settings, confirms that SFAL surpasses existing state-of-the-art AL methods in performance, demonstrating its significant potential for advancing medical data analysis.

Loading