Towards Fully-Automated Dataset Construction

ACL ARR 2025 May Submission90 Authors

07 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Application of advanced large language models to data annotation and synthesis automatized the process of dataset construction, yet participation of human experts is still inevitable. This paper proposed an approach to fully-automated dataset construction. With only the minimal information, high-quality datasets can be constructed fully automatically for various tasks. Utilizing constructed datasets for both supervised finetuning and few-shot learning improved performance constantly. Furthermore, the first mathematical formalization of the process of dataset construction is presented, providing the theoretical foundation of the proposed method.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: data augmentation, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English, Chinese
Submission Number: 90
Loading