Flaky Performances when Pre-Training on Relational Databases with a Plan for Future Characterization Efforts
Keywords: self-supervised learning, graph neural networks, relational databases, tabular data
TL;DR: Using unlabeled data (SSL pre-training of a GNN) can hurt RDB classification performances; we have an hypothesis and a plan to test it.
Abstract: We explore the downstream task performances for graph neural network (GNN) self-supervised learning (SSL) methods trained on subgraphs extracted from relational databases (RDBs). Intuitively, this joint use of SSL and GNNs allows us to leverage more of the available data, which could translate to better results. However, while we observe positive transfer in some cases, others showed systematic performance degradation, including some spectacular ones. We hypothesize a mechanism that could explain this behaviour and draft the plan for future work testing it by characterize how much relevant information different strategies can (theoretically and/or empirically) extract from (synthetic and/or real) RDBs.