Extreme Multi-label Text Classification with Pseudo Label DescriptionsDownload PDF


16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Extreme multi-label text classification (XMTC) is the task of tagging each document with the relevant labels in a large predefined label space, where the label frequency distribution is often highly skewed. That is, a large portion of labels (namely the tail labels) have very few positive instances, posing a hard optimization problem for training the classification models. The severe data sparse issue with tail labels is more announced in recent neural classifiers, where the embeddings of both the input documents and the output labels need to be jointly learned, and the success of such learning relies on the availability of sufficient training instances. This paper addresses this tough challenge in XMTC by proposing a novel approach that combines the strengths of both traditional bag-of-words (BoW) classifiers and recent neural embedding based classifiers. Specifically, we use a trained BoW model to generate a pseudo description for each label, and apply a neural model to establish the mapping between input documents and target labels in the latent embedding spaces. Our experiments show significant improvements of the proposed approach over other strong baseline methods on benchmark datasets, especially on tail label prediction. We also provide a theoretical analysis for relating BoW and neural models w.r.t. performance lower bound.
0 Replies
