UNREAL: Unlabeled Nodes Retrieval and Labeling for Heavily-imbalanced Node Classification

Divin Yan; Shengzhong Zhang; Bisheng Li; min zhou; Zengfeng Huang

UNREAL: Unlabeled Nodes Retrieval and Labeling for Heavily-imbalanced Node Classification

Divin Yan, Shengzhong Zhang, Bisheng Li, min zhou, Zengfeng Huang

Published: 01 Feb 2023, Last Modified: 05 Dec 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Node Classification, Heavily-imbalanced Representation Learning, Graph Neural Networks

TL;DR: A method for retrieving unlabeled node information to handle heavily-imbalanced node classification

Abstract: Extremely skewed label distributions are common in real-world node classification tasks. If not dealt with appropriately, it significantly hurts the performance of GNNs on minority classes. Due to the practical importance, there have been a series of recent researches devoted to this challenge. Existing over-sampling techniques smooth the label distribution by generating ''fake'' minority nodes and synthesize their features and local topology, which largely ignore the rich information of unlabeled nodes on graphs. Recent methods based on loss function modification re-weight different samples or change classification margins, which achieve good performance. However, representative methods need label information to estimate the distance of each node to its class center, which is unavailable on unlabeled nodes. In this paper, we propose UNREAL, which is an iterative over-sampling method. The first key difference is that we only add unlabeled nodes instead of synthetic nodes, which eliminates the challenge of feature and neighborhood generation. To select which unlabeled nodes to add, we propose geometric ranking, which ranks unlabeled nodes based on unsupervised learning results in the node embedding space. Finally, we identify the issue of geometric imbalance in the embedding space and provide a simple metric to filter out geometrically imbalanced nodes. Extensive experiments on real-world benchmark datasets are conducted, and the empirical results show that our method significantly outperforms current state-of-the-art methods consistent on different datasets with different imbalance ratios.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

Supplementary Material: zip

20 Replies

Loading