TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering

Published: 01 Jan 2024, Last Modified: 22 Jul 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The internet contains a vast amount of textual information, and a crucial step is to filter out irrelevant information and extract relevant topics of interest. However, supervised methods for filtering irrelevant information require a large amount of annotated data, which is time-consuming and labor-intensive. In the real world, it is also impractical to have annotated data covering all the text. Therefore, we propose a joint training framework for few-shot document filtering, which is based on a text-rich graph. Specifically, to filter out irrelevant information, we construct a text-rich graph by treating unlabeled documents, seed words, and seed document pairs as nodes, and their associations as edges. On this basis, we build a network learning module that combines neighborhood sampling based on PageRank and attention mechanisms to get graph embedding. Additionally, we construct a text embedding module by representing the original data using BERT for vector representation. Next, we leverage the advantages of the text representation from the text embedding module to enrich the network learning module through feature sharing, enhancing deep semantic features, and through joint training and label merging, we achieve the filtering of unlabeled documents. Extensive experiments conducted on two real-world datasets consistently demonstrate that our model outperforms existing technical alternatives. These alternatives include traditional classification and retrieval baselines for filtering documents with fewer frames. Furthermore, we conducted ablation studies to demonstrate the effectiveness of each component in enhancing filtering performance.
Loading