Abstract: Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies
to make sense of large, unlabeled corpora. In
this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting
unlabeled data to such taxonomies using a
small set of labeled examples. We propose HIERSEED, a novel weakly supervised algorithm
for this task that uses only a small set of labeled seed examples in a computation and data
efficient manner. HIERSEED assigns documents to topics by weighing document density
against topic hierarchical structure. It outperforms unsupervised and supervised baselines
for the SHC task on three real-world datasets.
0 Replies
Loading