From Text Segmentation to Enhanced Representation Learning: A Novel Approach to Multi-Label Classification for Long Texts

Wang Zhang, Xin Wang, Qian Wang, Tao Deng, Xiaoru Wu

Published: 2024, Last Modified: 11 Feb 2025EMNLP (Findings) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-label text classification (MLTC) is an important task in the field of natural language processing. Most existing models rely on high-quality text representations provided by pre-trained language models (PLMs). They hence face the challenge of input length limitation caused by PLMs, when dealing with long texts. In light of this, we introduce a comprehensive approach to multi-label long text classification. We propose a text segmentation algorithm, which guarantees to produce the optimal segmentation, to address the issue of input length limitation caused by PLMs. We incorporate external knowledge, labels’ co-occurrence relations, and attention mechanisms in representation learning to enhance both text and label representations. Our method’s effectiveness is validated through extensive experiments on various MLTC datasets, unraveling the intricate correlations between texts and labels.