A Survey of Structured Data Foundation Models: A Unified View on Foundation Models for Tables, Relational Databases and Knowledge Graphs
Abstract: Foundation models for text, images, video, or robot actions are trained with massive amounts of data to work on (human) prompts and sample answers from learned distributions. We survey foundation models that retrieve or predict answers from structured data. We use the term \emph{structured data} as a unifying term to refer to data in tables, relational databases, or knowledge graphs. Such structured data exhibits and relates values; it may come with a schema and knowledge descriptions. Considering the analogy with other foundation models, a foundation model for structured data should be trained on large amounts of found and/or synthetic structured data (a dataset $D_1$) and it should be ``prompted'' with a query $q$ that is executed on a structured dataset $D_2$, which may or may not overlap with $D_1$. The foundation model should retrieve and/or predict distributions over values, relations, or the schema and knowledge descriptions that the query $q$ has asked for, regardless of whether these are in $D_1 \cup D_2$ or not. While foundation models for tables, relational databases, and knowledge graphs have been explored in recent years and great progress has been achieved, on closer inspection, one finds that these foundation models do not fully cover the task just defined. No existing structured data foundation model retrieves and predicts distributions for values, relations, and schema or knowledge descriptions.
By providing a unified view and formalization of structured data foundation models, we provide a yardstick for measuring progress made on structured data foundation models and apply it in a survey of major paradigms.
Loading