Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Published: 24 Sept 2025, Last Modified: 18 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: schema lineage extraction, multilingual code analysis, large language models, small language models, semantic drift, composite evaluation metrics, chain-of-thought prompting, code understanding
TL;DR: We automate schema lineage extraction from multilingual enterprise pipelines using language models. Our SLiCE metric and 1,700-sample benchmark show 32B open-source models match GPT performance.
Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, create semantic disconnect between original metadata and transformed data. This "semantic drift" compromises data governance and impairs retrieval-augmented generation (RAG) and text-to-SQL systems. We propose a novel framework for automated schema lineage extraction from multilingual enterprise scripts, capturing four essential components: source schemas, source tables, transformation logic, and aggregation operations. We introduce Schema Lineage Composite Evaluation (SLiCE), a comprehensive metric assessing structural correctness and semantic fidelity, and present a benchmark of 1,700 manually annotated lineages from real-world industrial scripts. Evaluating 12 language models (1.3B-32B parameters, including GPT-4o/4.1), we demonstrate that extraction performance scales with model size and prompting sophistication. Specially, a 32B open-source model with chain-of-thought prompting achieves performance comparable to GPT-series models, enabling cost-effective deployment of schema-aware agents while maintaining rigorous data governance and enhancing downstream AI applications.
Submission Number: 3
Loading