DATASEA - AN AUTOMATIC FRAMEWORK FOR COMPREHENSIVE DATASET PROCESSING USING LARGE LANGUAGE MODELS

Yang Xiang; Xinye Yang; Yanghao Wu

DATASEA - AN AUTOMATIC FRAMEWORK FOR COMPREHENSIVE DATASET PROCESSING USING LARGE LANGUAGE MODELS

Yang Xiang, Xinye Yang, Yanghao Wu

28 Sept 2024 (modified: 01 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automated Data Processing, LLM, Data Pipeline Automation, NLP, Data Mining

TL;DR: A fully automatic system leveraging large language models to streamline dataset acquisition, metadata extraction, and preliminary analysis, enhancing research efficiency and data exploration.

Abstract: In the era of data-driven decision-making, efficiently acquiring and analyzing diverse datasets is critical for accelerating research and innovation. Yet, traditional manual approaches to dataset discovery, preparation, and exploration remain inefficient and cumbersome, especially as the scale and complexity of datasets continue to expand. These challenges create major roadblocks, slowing down the pace of progress and reducing the capacity for data-driven breakthroughs. To address these challenges, we introduce DataSEA (Search, Evaluate, Analyze), a fully automated system for comprehensive dataset processing, leveraging large language models (LLMs) to streamline the data handling pipeline. DataSEA autonomously searches for dataset sources, retrieves and organizes evaluation metadata, and generates custom scripts to load and analyze data based on user input. Users can provide just a dataset name, and DataSEA will handle the entire preparation process. While fully automated, minimal user interaction can further enhance system accuracy and dataset handling specificity. We evaluated DataSEA on datasets from distinct fields, demonstrating its robustness and efficiency in reducing the time and effort required for data preparation and exploration. By automating these foundational tasks, DataSEA empowers researchers to allocate more time to in-depth analysis and hypothesis generation, ultimately accelerating the pace of innovation. The code is available at https://github.com/SingleView11/DataSEA.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13220

Loading