SyntheRela: A Benchmark For Synthetic Relational Database Generation

Martin Jurkovic; Valter Hudovernik; Erik Štrumbelj

SyntheRela: A Benchmark For Synthetic Relational Database Generation

Martin Jurkovic, Valter Hudovernik, Erik Štrumbelj

Published: 04 Mar 2025, Last Modified: 17 Apr 2025ICLR 2025 Workshop SynthDataEveryoneRevisionsBibTeXCC BY 4.0

Keywords: relational database, benchmark, synthetic data, data generation, empirical comparison, graph neural networks

TL;DR: We introduce SyntheRela, an open-source benchmarking tool with novel evaluation metrics for relational database synthesis, comparing 6 methods across 8 real-world databases.

Abstract: Synthesizing relational databases has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reason, benchmarking methods for synthesizing relational databases introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational database synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine the best practices, a novel robust detection metric and relational deep learning utility, a novel approach to evaluating utility with graph neural networks, into a benchmarking tool. We use it to compare 6 open source methods over 8 real-world databases, with a total of 39 tables. The open-source SyntheRela benchmark is available on GitHub, alongside a public leaderboard.

Submission Number: 79

Loading