A Large-Scale Database for Graph Representation LearningDownload PDF

08 Jun 2021, 00:42 (edited 06 Nov 2021)NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone
  • Keywords: graph representation learning, graph classification, dataset, database, graphs
  • TL;DR: A large-scale graph representation learning database offering over 1.2 million graphs, averaging 15k nodes and 35k edges per graph
  • Abstract: With the rapid emergence of graph representation learning, the construction of new large-scale datasets are necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all of these desired properties. We introduce MalNet , the largest public graph database ever constructed, representing a large-scale ontology of malicious software function call graphs. MalNet contains over 1.2 million graphs, averaging over 15k nodes and 35k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x more classes. We provide a detailed analysis of MalNet, discussing its properties and provenance, along with the evaluation of state-of-the-art machine learning and graph neural network techniques. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning--enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publicly available at www.mal-net.org.
  • Supplementary Material: zip
  • URL: https://mal-net.org/
10 Replies

Loading