Document Classification using Reference Information

TMLR Paper4541 Authors

23 Mar 2025 (modified: 29 Mar 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Document classification is a common problem when organizing unstructured text data, but supervised classification algorithms often require many labeled examples and tedious manual annotation by humans labelers. This work propose an innovative methodology called Document Classification using Reference Information (DCRI), which classifies documents with little human intervention by leveraging the existence of reference information, documents from external sources related to the label classes of interest. For example, when classifying news articles into topics, Wikipedia articles can serve as such an external source. DCRI uses reference information to generate weak initial labels for an unlabeled corpus, then iteratively augments them into stronger labels using both supervised machine learning algorithms and limited human labeling capacity, if available. DCRI is evaluated on one dataset from a major pharmaceutical manufacturing company and two public datasets for news topic classification. When no human labeling capacity is available, DCRI achieves an accuracy between 84% to 96% on these three datasets. When some manual labeling capacity is available, DCRI helps prioritizing labeling documents with high uncertainty. To shed light on the value of reference information, this paper also develops a generative mathematical model in which reference information provides a noisy estimate of the latent distribution that generates documents. An extensive numerical study is performed using synthetic data to analyze when and why reference information is most valuable. Finally, for a special case of the model with two classes, a theoretical result is established to show the value of the iterative nature of the DCRI approach.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Gustavo_Carneiro1
Submission Number: 4541
Loading