Keywords: graph machine learning, tabular machine learning, graph neural network, gradient boosting, benchmark
TL;DR: We address the gap between tabular and graph machine learning by collecting a new benchmark of meaningful tabular datasets with known graph structure and introducing strong and simple baselines previously overlooked by research community.
Abstract: Tabular machine learning is an important field for industry and science. In this field, table rows are typically treated as independent data samples, but additional information about the relations between these samples is sometimes available and can be used to improve predictive performance. Such information can be naturally modeled with a graph, hence tabular machine learning may benefit from graph machine learning methods. However, graph machine learning models are typically evaluated on datasets with homogeneous, most often text-based node features, which are very different from heterogeneous mixtures of numerical and categorical features present in tabular datasets. Thus, there is a critical difference between the data used in tabular and graph machine learning studies, which does not allow one to understand how successfully graph models can be transferred to tabular data. To bridge this gap, we propose a new benchmark of diverse graphs with heterogeneous tabular node features and realistic prediction tasks. We use this benchmark to evaluate a vast set of models, including simple methods previously overlooked in the literature. Our experiments show that graph neural networks indeed can often bring gains in predictive performance for tabular data, but standard tabular models can also be adapted to work with graph data by using simple graph-based feature augmentation, which sometimes enables them to compete with and even outperform graph neural models. Based on our empirical study, we provide insights for researchers and practitioners in both tabular and graph machine learning fields.
Submission Number: 41
Loading