TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

ACL ARR 2025 February Submission7629 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the commendable progress of recent LLM-based data synthesis methods, they faces two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62\% ($49.07\rightarrow60.69$) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: LLM-based data synthesis,Table Instruction Tuning, Table Understanding, table QA, automatic creation and evaluation of language resources
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 7629
Loading