A Hybrid Active Learning Regression Approach for Accelerating Annotation with Data Generation Constraints

TMLR Paper4997 Authors

30 May 2025 (modified: 30 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In numerous scientific scenarios, experimental samples are designed as multiple data groups based on their underlying structures, \emph{e.g.,} with 1000 samples in each group, where these samples share certain similarities but include systematic physicochemical variations. Then, a smaller number of samples (\emph{e.g.,} 10) are selected to be placed in the parallel synthesizer, under a lengthy process, to collect their properties for subsequent machine learning analysis. Active learning, a technique that selects the most informative samples for the model, could reduce the cost of such a lengthy procedure by achieving better model performance with fewer labelled samples. However, generic batch-mode active learning algorithms are designed for sampling from a single sample pool and thus lack the mechanism to accelerate concurrent experiment execution with multiple data groups in such scientific scenarios. This paper proposes an active learning approach for scientific data with inherent group information, integrating multiple-output quantile regression for uncertainty estimation and combining the diversity of data distribution as a hybrid query method. The proposed method improves the efficiency of concurrent experiments, and the experimental results demonstrate its effectiveness on a suite of material science tasks.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Hsuan-Tien_Lin1
Submission Number: 4997
Loading