A Hybrid Active Learning Regression Approach for Accelerating Annotation with Data Generation Constraints
Abstract: In numerous scientific scenarios, experimental samples are designed as multiple data groups based on their underlying structures, e.g., with 1000 samples in each group, where these samples share certain similarities but include systematic physicochemical variations. Then, a smaller number of samples (e.g., 10) is selected to be placed in the parallel synthesizer, under a lengthy process, to conduct synthesis and collect their properties for subsequent machine learning analysis. Active learning, a technique that selects the most informative samples for the model, could reduce the cost of such a lengthy procedure by achieving better model performance with fewer labelled samples. However, generic batch-mode active learning algorithms are designed for sampling from a single sample pool and thus lack the mechanism to accelerate concurrent experiment execution with multiple data groups in such scientific scenarios. This paper proposes an active learning approach for scientific data with inherent group information, integrating multiple-output quantile regression for uncertainty estimation and combining the diversity of data distribution as a hybrid query method. The proposed method improves the efficiency of concurrent experiments, and the experimental results demonstrate its effectiveness on a suite of material science tasks.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Changes:
1. Added several citations in the third paragraph of the Introduction.
2. Added and changed citations in the Related Work.
3. In related Work: Move the description of RT-AL series methods from the subtitle "Hybrid AL strategy" to "Other Criteria in AL".
4. Updated some wording.
5. All changes are highlighted in red text.
Assigned Action Editor: ~Hsuan-Tien_Lin1
Submission Number: 4997
Loading