SchemaDB: A Dataset for Structures in Relational Data

Cody James Christopher, Kristen Moore, David Liebowitz

Published: 2022, Last Modified: 28 Sept 2024AusDM 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper we introduce the SchemaDB dataset; a collection of relational database schemas in both sql and graph formats. Databases are not commonly shared publicly for reasons of privacy and security, and so the corresponding schema for these databases are often not available for study. Consequently, an understanding of database structures in the wild is lacking, and most easily found examples of schema found publicly belong to common development frameworks or are derived from textbooks or engine benchmarks. SchemaDB contains 2,500 samples of relational schema found in public code repositories which have been standardised to MySQL syntax. We provide our gathering and transformation methodology, summary statistics, structural analysis, and discuss potential downstream research tasks in several domains.