Hierarchical Text Classification Optimization via Structural Entropy and Singular Smoothing

Published: 2025, Last Modified: 15 Jan 2026IEEE Trans. Knowl. Data Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With long-tailed data and complex label hierarchy, hierarchical text classification (HTC) is a challenging multi-label text classification task. Applying prompts to pre-trained language models (PLMs) has recently become a mainstream approach in HTC. However, existing prompt-based models experience a significant drop in classification performance on tail labels. Due to the imbalanced data, HTC models still face two challenges. First, text embeddings, learned for classification, often lack distinctiveness for tail categories. Second, label embeddings suffer from significant degeneration, especially for tail labels. To address these issues, in this paper, we propose a novel Hierarchical Text Classification Optimization method via Structural Entropy and SIngular Spectrum Smoothing, namely SIHTC. SIHTC contains two parts: text embedding optimization and label embedding optimization. First, based on the structural information theory, we design a tree aggregation network and construct encoding trees to minimize the structural entropy of texts under the hierarchical labels. In this manner, SIHTC injects label structural information into text embeddings, hierarchically optimizing the embedding space by enclosing the text embeddings within related ground truth labels while separating them from unrelated ground truth labels. Second, we propose a global and local singular spectrum smoothing regularization method to maximize the area under the singular value curve. In this way, SIHTC decreases representation degeneration and learns label embeddings with improved label generalization capability. Extensive experiments are conducted on three popular HTC datasets. The results show that SIHTC outperforms all baseline methods, especially with an advantage in handling tail labels, indicating the effectiveness of the above two optimizations.
Loading