Taming Prompt-Based Data Augmentation for Long-Tailed Extreme Multi-Label Text Classification

Pengyu Xu, Mingyang Song, Ziyi Li, Sijin Lu, Liping Jing, Jian Yu

Published: 2024, Last Modified: 22 Jan 2026ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In extreme multi-label text classification (XMC), labels usually follow a long-tailed distribution, where most labels only contain a small number of documents and limit the performance of XMC. Data augmentation (DA) is a simple but effective strategy to solve such low-resource problems. In this paper, we propose a prompt-based DA method called XDA, which is specifically designed for XMC. First, we employ a soft prompt during the fine-tuning process of the T5 model for label-conditional DA, thereby enabling T5 to augment samples while preserving label-compatibility. Subsequently, XDA performs sample filtering on the augmented samples through the diversity of text and the consistency of labels, which enhances the quality of the DA. In contrast to traditional sample-level DA, we propose a pair-level DA method by masking the augmented sample-label pairs of head-labels during training, effectively mitigating the long-tailed problem. Comprehensive experiments on benchmark datasets have shown that the proposed XDA outperforms the state-of-the-art counterparts.