UniHG: A Large-scale Universal Heterogeneous Graph Dataset and Benchmark for Representation Learning and Cross-Domain Transferring

Yide Qiu; Tong Zhang; Shaoxiang Ling; Xing Cai; Ziqi Gu; Zhen Cui

UniHG: A Large-scale Universal Heterogeneous Graph Dataset and Benchmark for Representation Learning and Cross-Domain Transferring

Yide Qiu, Tong Zhang, Shaoxiang Ling, Xing Cai, Ziqi Gu, Zhen Cui

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Universal knowledge mining, Dataset construction, Graph neural networks.

TL;DR: Dataset and Method for the Large-scale Universal Heterogeneous Graph.

Abstract: Irregular data in the real world are usually organized as heterogeneous graphs consisting of multiple types of nodes and edges. However, current heterogeneous graph research confronts three fundamental challenges: i) Benchmark Deficiency, ii) Semantic Disalignment, and iii) Propagation Degradation. In this paper, we construct a large-scale, universal, and joint multi-domain heterogeneous graph dataset named UniHG to facilitate heterogeneous graph representation learning and cross-domain knowledge mining. Overall, UniHG contains 77.31 million nodes and 564 million directed edges with thousands of labels and attributes, which is currently the largest universal heterogeneous graph dataset available to the best of our knowledge. To perform effective learning and provide comprehensively benchmarks on UniHG , two key measures are taken, including i) the semantic alignment strategy for multi-attribute entities, which projects the feature description of multi-attribute nodes and edges into a common embedding space to facilitate information aggregation; ii) proposing the novel Heterogeneous Graph Decoupling (HGD) framework with a specifically designed Anisotropy Feature Propagation (AFP) module for learning effective multi-hop anisotropic propagation kernels. These two strategies enable efficient information propagation among a tremendous number of multi-attribute entities and meanwhile mine multi-attribute association adaptively through the multi-hop aggregation in large-scale heterogeneous graphs. Comprehensive benchmark results demonstrate that our model significantly outperforms existing methods with an accuracy improvement of 28.93\%. And the UniHG can facilitate downstream tasks, achieving an NDCG@20 improvement rate of 11.48\% and 11.71\%. The UniHG dataset and benchmark codes have been released at https://anonymous.4open.science/r/UniHG-AA78.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/llooss/UniHG

Code URL: https://github.com/Yide-Qiu/UniHG

Supplementary Material: zip

Primary Area: Datasets & Benchmarks illustrating Different Deep learning Scenarios (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 2414

Loading