TabMeta: Table Metadata Generation with LLM-Curated Dataset and LLM-Judges

ACL ARR 2024 June Submission3211 Authors

15 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in LLMs have found use in several tabular related tasks including Text2SQL, data wrangling, imputation, Q&A, and other table-related tasks. Crucially however, researchers have often overlooked the fact that the downstream data consumers are often decoupled from the data producers. Downstream data users therefore, neither precisely know which tables to request access for and make use of, nor can easily understand complex cryptic terminology (in column names, etc) employed by the data producers. Specifically, the lack of descriptive metadata for tables has emerged as a significant obstacle to effective data governance and utilization. To tackle this, our work introduces TabMeta, a new natural language task aimed at automatically generating comprehensive metadata for arbitrarily complex tables, enabling non-expert users to discover, understand and use relevant data more effectively. First, we curate a unique benchmark dataset for the TabMeta task, consisting of table descriptions and column descriptions for 302 tables spanning 30 industry domains. Second, we propose two novel tabular metadata evaluation strategies (a) a robust and consistent LLM-Judge based framework which aligns with human judgement and employs confidence scores suited for tabular metadata and (b) ML based metrics to capture quality of the generated metadata such as conciseness, coherence and information gain. Finally, we also show that our metadata enhancement framework substantially improves the performance of tabular data discovery and search by a factor of 3-4x.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation, benchmarking, automatic evaluation of datasets
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 3211