Hierarchical Active Learning With Label Proportions on Data Regions

Zhipeng Luo; Qiang Gao; Yazhou He; Hongjun Wang; Milos Hauskrecht; Tianrui Li

Hierarchical Active Learning With Label Proportions on Data Regions

Zhipeng Luo, Qiang Gao, Yazhou He, Hongjun Wang, Milos Hauskrecht, Tianrui Li

Published: 01 Jan 2024, Last Modified: 07 Dec 2024IEEE Trans. Knowl. Data Eng. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Learning classification models from real-world data often requires substantial human effort devoted to instance annotation. As the instance-based annotating process can be very time-consuming and costly, we propose a novel active learning framework that builds classification models from human-annotated regions . A region is defined by a set of conjunctive patterns that are formed by value ranges over the input features. A region label is a human assessment of the class proportion in the data population covered by the region. By leveraging learning from label proportions algorithms, regions and their class proportions can be used to train instance-based classification models. However, the key challenge is that in practice, very few regions are defined already. Therefore, to identify regions important for model learning, we design a hierarchical active learning (HAL) framework, which actively builds a hierarchy of regions. Similar to the decision-tree learning process, our approach progressively divides the input data space into smaller sub-regions, solicits labels for the new regions, and retrains the base classification model with all the leaf regions. And we further develop a multi-hierarchy (forest) solution, which builds multiple shallower hierarchies that have more informative, diverse, and simpler regions. We evaluate our HAL framework on numerous impactful classification datasets as well as on a real user study - on the survival analysis of colorectal cancer patients. The results demonstrate that region-based active learning methods can learn high-quality classifiers from very few labeled regions. Hence, our framework is shown very effective in reducing the human annotation effort needed for building classification models.

Loading