Keywords: Data Centric, LLM Agents, Survey Paper, Data Science
Abstract: Large Language Model (LLM)-based agents are increasingly employed to automate data science workflows, from preprocessing and modeling to interpretation and decision-making. While recent surveys have explored their technical designs, little attention has been paid to how these agents adapt to datasets with different structures and domain requirements. This survey adopts a dataset-centric perspective, categorizing LLM-based data agents by the types of data they are designed for and evaluated on. We analyze key design choices—such as planning strategies, self-correction mechanisms, multi-agent collaboration, and tool integration—through the lens of structured, semi-structured, and unstructured data, as well as domain-specific applications. We introduce a hierarchical taxonomy that connects agents’ capabilities in data management and analysis to their dataset contexts. Our analysis highlights current gaps in benchmark diversity and generalization, offering insights into the practical limitations of existing agents. We conclude by outlining future directions for designing and evaluating LLM-based data agents that are robust, adaptable, and dataset-aware.
Submission Number: 13
Loading