Gandalf: Data Augmentation is all you need for Extreme Classification

Anonymous

Gandalf: Data Augmentation is all you need for Extreme Classification

Anonymous

16 Jul 2022 (modified: 05 May 2023)ACL ARR 2022 July Blind SubmissionReaders: Everyone

Abstract: Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain are increasingly focusing on the problem setting with (i) short-text input data, and (ii) labels endowed with meta-data in the form of textual descriptions. Short-text XMC with label features has found numerous applications in areas such as prediction of Related Searches, product recommendation based on titles, and bid-phrase suggestion, amongst others. In this work, by exploiting the problem characteristics of short-text XMC, we develop postulates stating the desired invariances, and propose two data augmentation techniques to achieve them. One, LabelMix, which performs data augmentation by concatenating an annotating label to the data-point; and the other, Gandalf, which generates additional data-points by considering labels as legitimate data-points. The efficacy of the proposed augmentation methods is demonstrated by showing upto 30% relative improvement when applied to a range of existing algorithms, and proposing an algorithmic framework, InceptionXML-LF, which furthers state-of-the-art on benchmark datasets.

Paper Type: long

0 Replies

Loading